API rate limiting is the discipline of capping how often a client may call your service in a window of time — and the choice of algorithm decides whether you protect downstream systems gracefully or block legitimate users by accident. Five algorithms dominate production practice, and they are not interchangeable: each trades memory, burst tolerance, and accuracy differently.

The set was codified for a generation of engineers by Alex Xu's System Design Interview, which treats Fixed Window Counter, Sliding Window Log, Sliding Window Counter, Token Bucket, and Leaky Bucket as the canonical reference family. What that book frames as an interview answer is, in practice, the actual menu every platform team picks from when it builds a limiter.

This reference does three things. It compares the five algorithms on the axes that matter in production — memory cost, burst behavior, accuracy, and Redis implementation. It grounds the recommendation in real data, including Cloudflare's published analysis of sliding window counters at global scale. And it maps the messy reality of rate-limit headers, where the emerging IETF standard and the vendor-specific conventions used by GitHub and Stripe diverge sharply.

Key takeaways

01
Five algorithms, five distinct load shapes.Fixed Window is cheapest but allows 2x bursts at the boundary; Sliding Window Log is exact but O(n) memory; Sliding Window Counter is the near-exact O(1) default; Token Bucket permits controlled bursts; Leaky Bucket enforces a strict constant output rate.
02
Sliding Window Counter is the production default.It blends current- and previous-window counts with elapsed-time weighting, needs only two keys per client regardless of traffic, and is what Cloudflare runs across its global network — with a published 0.003% error rate across 400 million requests.
03
Redis limiters need Lua, not MULTI/EXEC.Lua scripts via EVAL execute atomically server-side in one round trip, can read a value and branch on it inside the same atomic operation, and avoid the TOCTOU races that WATCH/MULTI/EXEC retry loops are designed to paper over.
04
The 'standard' header you know is legacy.The de-facto X-RateLimit-* convention predates the actual standard. IETF draft-11 (May 23, 2026) defines RateLimit and RateLimit-Policy with structured parameters. Cloudflare adopted it in September 2025; GitHub and Stripe still use vendor-prefixed headers.
05
Client-side resilience is half the system.Server-side limits are only useful if clients back off well. AWS research found Full Jitter exponential backoff reduced call counts by over 50% versus non-jittered backoff with 100 contending clients — non-jittered backoff was the clear loser.

01 — The Five AlgorithmsThe canonical five, and what each is actually for.

Before the comparison table, it is worth stating plainly what each algorithm does and where its single biggest weakness lives. The differences are not academic — picking the wrong one shows up as either accidental blocking of real users or an attacker doubling your intended limit.

Fixed Window Counter

Count requests per fixed clock window (per minute, per hour) in a single key, reset at the boundary. It is the cheapest option — one key, one increment. Its core vulnerability is boundary amplification: a client can send roughly twice the limit by straddling the window edge, for example 100 requests at the end of one minute and 100 more at the start of the next, all within a couple of seconds.

Sliding Window Log

Store every request timestamp in a sorted set, then count entries inside the trailing window on each request. This is the most accurate algorithm — no boundary artifact at all — but it carries O(n) memory per client, where n is the number of requests in the window. Under high traffic it becomes impractical, because a single noisy client can hold thousands of timestamps.

Sliding Window Counter

The pragmatic middle ground, and the recommended default for most production APIs. It blends the current and previous fixed-window counts using elapsed-time weighting: estimate = prev_count × (1 − elapsed/window) + curr_count. The result is near-exact accuracy at O(1) memory — just two keys per client, regardless of how much traffic that client sends.

Token Bucket

A bucket accumulates tokens over time up to a maximum (the bucket size); each request deducts one token, and requests with no token available are rejected. This is the best default for developer-facing APIs that need to permit short bursts while still enforcing a long-term average. State per client is just two fields: the current token count and the last refill time.

Leaky Bucket

Requests enter a queue and drain at a strict constant rate. In policing mode excess requests are dropped; in shaping mode they are delayed. It is best suited to downstream-service protection where burst absorption is undesirable and you want a smooth, predictable output rate. Shopify uses a leaky-bucket model for its API.

Cheapest

Fixed Window Counter

O(1) · 1 key

One counter per clock window. Trivial to implement, but allows roughly 2x the limit at the window boundary. Fine for internal limits or login throttles where the boundary edge case does not matter.

Best for: internal limits, login throttle

Most accurate

Sliding Window Log

O(n) · 1 sorted set

Stores every request timestamp. Exact with no boundary artifact, but memory grows with request volume per client. Reserve it for strict audit or high-value B2B APIs where exactness justifies the cost.

Best for: strict audit APIs

The default

Sliding Window Counter

O(1) · 2 keys

Weighted blend of current and previous window counts. Near-exact accuracy at constant memory. Smooths the boundary burst and scales to any traffic. This is what most production public APIs should run.

Best for: general public APIs

Burst-friendly

Token Bucket

O(1) · 1 hash

Accumulates capacity up to a cap and deducts per request. Permits controlled bursts while enforcing a long-term average. The natural fit for developer APIs and SDKs that expect spiky usage.

Best for: developer APIs, SDKs

Strict shaping

Leaky Bucket

O(1) · 1 hash

Drains at a strict constant rate; excess is dropped (policing) or delayed (shaping). Best for protecting fragile downstream services where a smooth, predictable output rate matters more than burst tolerance.

Best for: traffic shaping

02 — Comparison MatrixOne 5x6 matrix, sourced cell by cell.

Most comparisons cover two or three algorithms, omit memory complexity, or skip real production users. The matrix below pulls all five onto a single grid with the axes that actually drive the decision: memory cost in Big-O terms, burst behavior, accuracy, the Redis structures each implementation needs, and a real-world example user for each.

Algorithm

Fixed Window CounterSTRING + Lua

Memory · accuracy

O(1), 1 key · Approximate

Bursts · best for

2x burst at the window boundary. Best for internal limits and login throttles where the boundary edge case is acceptable. Example: simple middleware limiters.

Algorithm

Sliding Window LogSORTED SET + Lua

Memory · accuracy

O(n), 1 sorted set · Exact

Bursts · best for

No burst artifact. Best for strict audit APIs and high-value B2B endpoints. Memory grows with request volume, so it is impractical under high traffic.

Algorithm

Sliding Window Counter2 STRINGs + Lua

Memory · accuracy

O(1), 2 keys · Near-exact

Bursts · best for

Smoothed boundary, constant memory. Best for general public APIs. Example: Cloudflare runs this across its global network.

Algorithm

Token BucketHASH + Lua

Memory · accuracy

O(1), 1 hash (2 fields) · Exact

Bursts · best for

Controlled bursts up to the bucket cap. Best for developer APIs and SDKs. Example: Stripe recommends a client-side token-bucket limiter.

Algorithm

Leaky BucketHASH + Lua

Memory · accuracy

O(1), 1 hash · Exact output rate

Bursts · best for

No bursts — excess dropped or delayed. Best for traffic shaping and downstream protection. Example: Shopify uses a leaky-bucket model for its API.

Algorithm	Memory · accuracy	Bursts · best for
Fixed Window CounterSTRING + Lua	O(1), 1 key · Approximate	2x burst at the window boundary. Best for internal limits and login throttles where the boundary edge case is acceptable. Example: simple middleware limiters.
Sliding Window LogSORTED SET + Lua	O(n), 1 sorted set · Exact	No burst artifact. Best for strict audit APIs and high-value B2B endpoints. Memory grows with request volume, so it is impractical under high traffic.
Sliding Window Counter2 STRINGs + Lua	O(1), 2 keys · Near-exact	Smoothed boundary, constant memory. Best for general public APIs. Example: Cloudflare runs this across its global network.
Token BucketHASH + Lua	O(1), 1 hash (2 fields) · Exact	Controlled bursts up to the bucket cap. Best for developer APIs and SDKs. Example: Stripe recommends a client-side token-bucket limiter.
Leaky BucketHASH + Lua	O(1), 1 hash · Exact output rate	No bursts — excess dropped or delayed. Best for traffic shaping and downstream protection. Example: Shopify uses a leaky-bucket model for its API.

How to read this matrix

The two columns that decide most production choices are memory and burst behavior. Sliding Window Counter is the default precisely because it is the only row that pairs O(1) memory with near-exact accuracy and a smoothed boundary. Reach for Token Bucket when bursts are a feature, and Leaky Bucket when bursts are a threat.

03 — Redis ImplementationWhy every limiter wants Lua, not MULTI/EXEC.

All five algorithms share the same correctness hazard: a rate-limit check is a read-modify-write. Read the current count, decide whether to allow, write the new count. If two requests interleave between the read and the write, both can be allowed past a limit that should have stopped one of them — a classic time-of-check-to-time-of-use (TOCTOU) race.

The Redis-native fix is a Lua script run via EVAL. Lua executes atomically on the server in a single round trip, so there is no window for another client to slip in between the read and the write — no TOCTOU race, and no WATCH/MULTI/EXEC retry loop. Crucially, and unlike MULTI/EXEC, a Lua script can read a value and branch on it within the same atomic operation, which is exactly what every one of these algorithms needs to do.

Distributed rate limiting is fundamentally a consistency problem.Arcjet engineering blog

That framing is the right one to keep in mind. A limiter is only as correct as the consistency of the counter behind it. The single-node Lua approach gives you atomicity for free; the harder problem, covered in Section 07, is keeping counters consistent across many instances without making the counter store itself the bottleneck.

04 — Production EvidenceCloudflare's published accuracy data.

The strongest case for choosing approximate sliding-window counting over exact sliding-window logging at scale is Cloudflare's own published analysis. In a historical analysis of its sliding-window counter running across its global network, Cloudflare measured the following over 400 million requests from 270,000 sources.

Sliding Window Counter accuracy · Cloudflare production

Source: Cloudflare published analysis

Total error rateAcross 400M requests · 270,000 sources

0.003%

Average estimate differenceEstimated rate vs real rate

False positivesLegitimate sources incorrectly blocked

Two numbers carry the recommendation. A 0.003% total error rate with a 6% average difference between the estimated and real rate means the approximation is comfortably good enough for enforcement — and the zero false positives means no legitimate source was wrongly blocked, which is the failure mode that actually costs you users. The lesson is that exactness is rarely worth O(n) memory when an O(1) approximation lands this close.

The architecture underneath is as instructive as the accuracy. Rather than a single centralised counter store, Cloudflare runs per-PoP counters — each point of presence keeps its own counts, because a single central store would collapse under Layer-7 attack traffic and add a single point of failure. Anycast routing tends to send the same client to the same PoP, which is what makes isolated per-PoP counters viable in the first place. The system handles several billion rate-limited requests per day and has mitigated attacks of 400,000 requests per second aimed at a single domain.

The architectural insight

The per-PoP versus centralised tradeoff is the part most reference posts skip. A centralised counter store is the obvious design — and the wrong one at attack scale, because it adds latency to every request and becomes the thing an attacker takes down first. Isolated counters plus consistent routing beats a single source of truth when the threat model is volume.

05 — Headers StandardThe standard you think you know is legacy.

Most engineering writing describes X-RateLimit-* as "the standard." It is not. It is a widely-copied de-facto convention that predates any actual specification. The real emerging standard is draft-ietf-httpapi-ratelimit-headers-11, published May 23, 2026 on the Standards Track by the IETF HTTPAPI Working Group — and it uses a completely different field syntax.

The draft defines two header fields. RateLimit reports the remaining service limit — a structured value with r for available quota and t for the effective window. RateLimit-Policy reports the quota policy — q for the total quota, w for the window, and an optional pk partition key. Cloudflare adopted this draft in September 2025; GitHub and Stripe have not, and still emit their own vendor-prefixed headers.

Provider

Cloudflare

Headers · format

Ratelimit + Ratelimit-Policy"default";r=50;t=30

429 & compliance

Returns Retry-After on 429 (seconds until refresh). Compliant with the IETF draft — adopted September 2025.

Provider

GitHub

Headers · format

x-ratelimit-*x-ratelimit-remaining: 4999

429 & compliance

limit / remaining / used / reset / resource. reset is a UTC epoch timestamp, not a delay. Not adopted — legacy X- prefix.

Provider

Stripe

Headers · format

Stripe-Rate-Limited-Reasonglobal-rate

429 & compliance

Reason values: global-concurrency, global-rate, endpoint-concurrency, endpoint-rate, resource-specific. SDKs handle retries internally. Not adopted — vendor prefix.

Provider

IETF draft-11

Headers · format

RateLimit + RateLimit-Policyr=50;t=30 · q=100;w=60

429 & compliance

Separate Retry-After. This is the standard itself — Standards Track, not yet an RFC, so treat field names as still-stabilising.

Provider	Headers · format	429 & compliance
Cloudflare	`Ratelimit` + `Ratelimit-Policy`"default";r=50;t=30	Returns Retry-After on 429 (seconds until refresh). Compliant with the IETF draft — adopted September 2025.
GitHub	`x-ratelimit-*`x-ratelimit-remaining: 4999	limit / remaining / used / reset / resource. reset is a UTC epoch timestamp, not a delay. Not adopted — legacy X- prefix.
Stripe	`Stripe-Rate-Limited-Reason`global-rate	Reason values: global-concurrency, global-rate, endpoint-concurrency, endpoint-rate, resource-specific. SDKs handle retries internally. Not adopted — vendor prefix.
IETF draft-11	`RateLimit` + `RateLimit-Policy`r=50;t=30 · q=100;w=60	Separate Retry-After. This is the standard itself — Standards Track, not yet an RFC, so treat field names as still-stabilising.

Three header gotchas

Three details break naive clients. GitHub's x-ratelimit-reset is a Unix epoch timestamp, not a seconds-until-reset delay — you must compute reset − now() yourself. Stripe does not document a Retry-After header; its SDKs retry internally, so do not assume one is present. And the IETF draft is Standards Track but not yet an RFC — it is not RFC 7807 (that is a different spec, Problem Details for HTTP APIs).

Clients MUST NOT consider the available quota parameter as a service level agreement.IETF draft-ietf-httpapi-ratelimit-headers-11

That line is the whole philosophy of the spec in one sentence. The headers are a hint to be polite with, not a contract you can lean on. The draft even warns that "many throttled clients may come back at the very moment specified," which is why effective-window calculations need jitter — the same principle that governs client-side retries in Section 08.

06 — Provider LimitsGitHub, Stripe, and Cloudflare, by the numbers.

Three of the most-integrated APIs publish their limits in detail, and the shapes are instructively different. GitHub layers a complex set of secondary limits over a simple hourly bucket; Stripe overrides a global per-second cap with much lower per-endpoint limits; Cloudflare enforces a hard five-minute window that locks out all API calls on breach.

GitHub primary

Authenticated REST tier

5,000/hr

Unauthenticated is 60 req/hour; a personal access token gets 5,000/hour; GitHub Enterprise Cloud apps get 15,000/hour; the GITHUB_TOKEN in Actions is 1,000/hour per repository.

Primary hourly bucket

GitHub secondary

The limit you actually hit

900pts/min

On top of the hourly bucket: <=100 concurrent requests, <=900 REST points/minute (GET=1pt, POST/PATCH/PUT/DELETE=5pt), <=90 CPU seconds per 60 real seconds, <=80 content-creating requests/minute.

POST/DELETE cost 5x GET

Stripe

Global live-mode limit

100/s

Live mode is 100 req/s (sandbox 25 req/s). Per-endpoint overrides go lower: Files and Search at 20 read/s, Create Payout at 15 req/s. Meter events are a separate 1,000 calls/s pool.

429 on breach

Cloudflare API

Per user/account token

1,200/5min

1,200 requests per 5-minute window per token, 200 req/s per IP, GraphQL capped at 320 per 5 minutes. Exceeding any limit returns 429 and blocks all API calls for the next five minutes.

Enterprise can raise

The pattern worth internalising is cost-aware rate limiting: limit by the resource cost of a request, not just its count. GitHub's secondary limit charges a POST, PUT, PATCH, or DELETE five points against a GET's one — so a write-heavy integration can exhaust the points budget long before it touches the hourly request bucket. Stripe applies the same idea differently, capping resource-intensive endpoints like Search and Files well below the global per-second limit. If you are building a limiter for your own API, weighting expensive operations is usually more important than the raw request ceiling.

07 — Distributed EnforcementCounters across many instances.

A single-process limiter is easy; the hard version is enforcing one shared limit across many instances. Two patterns dominate. The first is a shared counter store — typically Redis — synchronised across all nodes. Kong Gateway's rate-limiting-advanced plugin takes this route, using Redis clusters to synchronise counters across all Kong data-plane nodes, and can identify clients by consumer, credential, IP address, service, route, or a custom header value. That is the standard gateway-layer pattern for multi-instance deployments.

The second pattern is isolated counters plus consistent routing, which is what Cloudflare's per-PoP design demonstrates at the far end of the scale curve. Each node enforces independently, and routing keeps a given client landing on the same node often enough that isolated counts stay close to a global view. It trades a little accuracy for resilience — the right call when the threat model is volume rather than precision.

One sharp warning applies here. Nginx is not distributed. Its limit_req_zone and limit_req directives implement a leaky-bucket limiter, where the burst parameter queues excess requests and nodelay drops them immediately — but state is per-process, per-server. For distributed enforcement you need Redis behind it via OpenResty and lua-resty-limit-traffic, or an upstream API gateway. Presenting plain Nginx as a distributed solution is a common and costly mistake.

Multi-instance API

Shared Redis counters

Synchronise counts across all nodes through a Redis cluster — the Kong rate-limiting-advanced approach. Identify clients by consumer, credential, IP, service, route, or header. The standard gateway-layer pattern.

Pick a gateway + Redis

Extreme scale / attack

Per-node isolated counters

Each node counts independently; consistent routing keeps clients on the same node. Cloudflare's per-PoP design. Trades a little accuracy for resilience under Layer-7 attack volume.

Pick per-node + routing

Single server

Nginx leaky bucket

limit_req_zone + limit_req gives a per-process leaky bucket with burst queueing. Fine for one box — but not distributed. Add Redis via OpenResty or an upstream gateway the moment you scale horizontally.

Nginx alone = single node only

08 — Client-SideBackoff, jitter, and graceful degradation.

Server-side limits are only half the system. How clients react to a 429 decides whether a transient limit becomes a self-inflicted outage. The canonical mistake is naive retry — every throttled client retrying on the same fixed schedule, re-colliding in lockstep. The fix is exponential backoff with jitter.

AWS research on backoff strategies is the reference here. With 100 contending clients, "Full Jitter" — where the sleep is random(0, min(cap, base × 2^attempt)) — reduced the total call count by over 50% compared to non-jittered exponential backoff, and non-jittered backoff was "the clear loser" among the options tested. AWS describes three variants: Full Jitter, Equal Jitter, and Decorrelated Jitter. The mechanism is simple: randomness spreads clustered retry attempts into a steady-state pattern, flattening the contention spikes that synchronised retries create.

Stripe's own guidance

Stripe's documentation puts the client-side burden plainly: treat limits as maximums and avoid unnecessary load, and consider implementing a token-bucket limiter on the client side. The headers tell you where you stand; the discipline of not pushing to the ceiling is what keeps you out of 429 territory in the first place.

When backoff is not enough — sustained throttling rather than a transient spike — graceful degradation takes over. The practical toolkit is to serve stale cached responses, reduce response fidelity by dropping non-essential fields, fan out to a secondary provider, and wrap the dependency in a circuit breaker that fails fast once retries are consistently futile rather than queueing requests that will never succeed. Circuit breakers exist precisely for sustained unavailability, where continuing to retry only deepens the contention.

None of this works blind. Observability for rate limiting spans three layers — the proxy or gateway statistics (Envoy, Kong, Nginx), the rate-limit service counters in Redis, and Redis's own health metrics. The signals worth alerting on are the rate-limited request percentage per endpoint, retry success rates, latency percentiles under throttling, and circuit-breaker state transitions. The client-side complement — how to make retries idempotent so a backoff loop never double-applies an effect — is its own discipline, covered in our guide to webhook reliability, idempotency, and retries.

09 — Decision GuidePicking an algorithm by workload.

The algorithm choice should follow the load shape and the failure you most want to avoid, not a default copied from the last project. Four common workloads map cleanly onto the five algorithms.

Public API at scale

General-purpose throttling

Sliding Window Counter — O(1) memory, near-exact accuracy, a smoothed boundary, and Cloudflare's production track record. This is the default unless a specific requirement pushes you elsewhere.

Pick Sliding Window Counter

Developer API / SDK

Bursty, average-bounded usage

Token Bucket. It lets clients spend a burst of accumulated capacity for spiky workloads while still enforcing a long-term average — the behavior developers expect from a metered API.

Pick Token Bucket

Fragile downstream

Strict output shaping

Leaky Bucket. When bursts are a threat to a downstream service rather than a feature, the constant drain rate is exactly what you want — drop or delay, never amplify.

Pick Leaky Bucket

Audit / billing-grade

Exactness is non-negotiable

Sliding Window Log when correctness must be perfect and traffic per client is bounded; Fixed Window only for internal limits where the 2x boundary burst genuinely does not matter.

Log for exact, Fixed for cheap

The forward-looking signal is the header standard, not the algorithms. The algorithm set has been stable for years and is unlikely to change; what is shifting is interoperability. As more providers follow Cloudflare onto the IETF RateLimit / RateLimit-Policy headers, the cost of building a client that talks to many APIs should fall — but for now, any multi-provider client still has to special-case GitHub's epoch-timestamp resets and Stripe's SDK-internal retries alongside the new standard. Teams running AI inference APIs feel this most acutely, where per-tenant quota enforcement is a core cost-control lever; we cover that economics in our AI inference cost optimization playbook, and the shared Redis infrastructure pattern in our Redis caching strategies guide. If you are designing this layer for a production system, our web development engagements start with exactly this kind of architecture review.

10 — ConclusionA solved problem with one moving part.

The state of rate limiting, May 2026

Pick the algorithm for the load shape, and standardise the headers.

Rate limiting is a mature discipline with a stable answer set. Five algorithms cover the field, and the choice among them is governed by two questions: how much memory you can spend per client, and whether bursts are a feature or a threat. For most public APIs the answer is Sliding Window Counter — O(1) memory, near-exact accuracy, and a production track record at Cloudflare scale that few alternatives can match.

The implementation details that bite are well-understood once named: use a Lua script for atomicity rather than a MULTI/EXEC retry loop, weight expensive operations the way GitHub charges five points for a write, and never mistake Nginx's per-process limiter for a distributed one. On the client side, exponential backoff with jitter is not optional — AWS's own research shows non-jittered backoff is the clear loser, and a circuit breaker is what keeps sustained throttling from becoming an outage.

The one genuinely moving part is interoperability. The X-RateLimit-* convention most engineers treat as standard is legacy; the actual standard, IETF draft-11 from May 2026, uses different field names and is only beginning to see adoption. Build your limiter on the algorithm that fits your load, instrument all three observability layers, and write your client to tolerate the header divergence — because that divergence is the part of this problem still being settled.

API Rate Limiting: Five Algorithms, One 2026 Reference

01 — The Five AlgorithmsThe canonical five, and what each is actually for.

Fixed Window Counter

Sliding Window Log

Sliding Window Counter

Token Bucket

Leaky Bucket

Fixed Window Counter

Sliding Window Log

Sliding Window Counter

Token Bucket

Leaky Bucket

02 — Comparison MatrixOne 5x6 matrix, sourced cell by cell.

03 — Redis ImplementationWhy every limiter wants Lua, not MULTI/EXEC.

04 — Production EvidenceCloudflare's published accuracy data.

Sliding Window Counter accuracy · Cloudflare production

05 — Headers StandardThe standard you think you know is legacy.

06 — Provider LimitsGitHub, Stripe, and Cloudflare, by the numbers.

Authenticated REST tier

The limit you actually hit

Global live-mode limit

Per user/account token

07 — Distributed EnforcementCounters across many instances.

Shared Redis counters

Per-node isolated counters

Nginx leaky bucket

08 — Client-SideBackoff, jitter, and graceful degradation.

09 — Decision GuidePicking an algorithm by workload.

General-purpose throttling

Bursty, average-bounded usage

Strict output shaping

Exactness is non-negotiable

10 — ConclusionA solved problem with one moving part.

Pick the algorithm for the load shape, and standardise the headers.

A rate-limiting layer that stays graceful under real traffic.

Platform & API engineering

The questions we get every week.

Continue exploring engineering references.

Cloudflare Temporary Accounts for AI Agents, Explained

GraphQL vs REST in 2026: API Architecture Decision

REST API Design in 2026: A Full Engineering Reference

Background Jobs and Queues: 2026 Engineering Reference