API rate limiting is the discipline of capping how often a client may call your service in a window of time — and the choice of algorithm decides whether you protect downstream systems gracefully or block legitimate users by accident. Five algorithms dominate production practice, and they are not interchangeable: each trades memory, burst tolerance, and accuracy differently.
The set was codified for a generation of engineers by Alex Xu's System Design Interview, which treats Fixed Window Counter, Sliding Window Log, Sliding Window Counter, Token Bucket, and Leaky Bucket as the canonical reference family. What that book frames as an interview answer is, in practice, the actual menu every platform team picks from when it builds a limiter.
This reference does three things. It compares the five algorithms on the axes that matter in production — memory cost, burst behavior, accuracy, and Redis implementation. It grounds the recommendation in real data, including Cloudflare's published analysis of sliding window counters at global scale. And it maps the messy reality of rate-limit headers, where the emerging IETF standard and the vendor-specific conventions used by GitHub and Stripe diverge sharply.
- 01Five algorithms, five distinct load shapes.Fixed Window is cheapest but allows 2x bursts at the boundary; Sliding Window Log is exact but O(n) memory; Sliding Window Counter is the near-exact O(1) default; Token Bucket permits controlled bursts; Leaky Bucket enforces a strict constant output rate.
- 02Sliding Window Counter is the production default.It blends current- and previous-window counts with elapsed-time weighting, needs only two keys per client regardless of traffic, and is what Cloudflare runs across its global network — with a published 0.003% error rate across 400 million requests.
- 03Redis limiters need Lua, not MULTI/EXEC.Lua scripts via EVAL execute atomically server-side in one round trip, can read a value and branch on it inside the same atomic operation, and avoid the TOCTOU races that WATCH/MULTI/EXEC retry loops are designed to paper over.
- 04The 'standard' header you know is legacy.The de-facto X-RateLimit-* convention predates the actual standard. IETF draft-11 (May 23, 2026) defines RateLimit and RateLimit-Policy with structured parameters. Cloudflare adopted it in September 2025; GitHub and Stripe still use vendor-prefixed headers.
- 05Client-side resilience is half the system.Server-side limits are only useful if clients back off well. AWS research found Full Jitter exponential backoff reduced call counts by over 50% versus non-jittered backoff with 100 contending clients — non-jittered backoff was the clear loser.
01 — The Five AlgorithmsThe canonical five, and what each is actually for.
Before the comparison table, it is worth stating plainly what each algorithm does and where its single biggest weakness lives. The differences are not academic — picking the wrong one shows up as either accidental blocking of real users or an attacker doubling your intended limit.
Fixed Window Counter
Count requests per fixed clock window (per minute, per hour) in a single key, reset at the boundary. It is the cheapest option — one key, one increment. Its core vulnerability is boundary amplification: a client can send roughly twice the limit by straddling the window edge, for example 100 requests at the end of one minute and 100 more at the start of the next, all within a couple of seconds.
Sliding Window Log
Store every request timestamp in a sorted set, then count entries inside the trailing window on each request. This is the most accurate algorithm — no boundary artifact at all — but it carries O(n) memory per client, where n is the number of requests in the window. Under high traffic it becomes impractical, because a single noisy client can hold thousands of timestamps.
Sliding Window Counter
The pragmatic middle ground, and the recommended default for most production APIs. It blends the current and previous fixed-window counts using elapsed-time weighting: estimate = prev_count × (1 − elapsed/window) + curr_count. The result is near-exact accuracy at O(1) memory — just two keys per client, regardless of how much traffic that client sends.
Token Bucket
A bucket accumulates tokens over time up to a maximum (the bucket size); each request deducts one token, and requests with no token available are rejected. This is the best default for developer-facing APIs that need to permit short bursts while still enforcing a long-term average. State per client is just two fields: the current token count and the last refill time.
Leaky Bucket
Requests enter a queue and drain at a strict constant rate. In policing mode excess requests are dropped; in shaping mode they are delayed. It is best suited to downstream-service protection where burst absorption is undesirable and you want a smooth, predictable output rate. Shopify uses a leaky-bucket model for its API.
Fixed Window Counter
One counter per clock window. Trivial to implement, but allows roughly 2x the limit at the window boundary. Fine for internal limits or login throttles where the boundary edge case does not matter.
Sliding Window Log
Stores every request timestamp. Exact with no boundary artifact, but memory grows with request volume per client. Reserve it for strict audit or high-value B2B APIs where exactness justifies the cost.
Sliding Window Counter
Weighted blend of current and previous window counts. Near-exact accuracy at constant memory. Smooths the boundary burst and scales to any traffic. This is what most production public APIs should run.
Token Bucket
Accumulates capacity up to a cap and deducts per request. Permits controlled bursts while enforcing a long-term average. The natural fit for developer APIs and SDKs that expect spiky usage.
Leaky Bucket
Drains at a strict constant rate; excess is dropped (policing) or delayed (shaping). Best for protecting fragile downstream services where a smooth, predictable output rate matters more than burst tolerance.
02 — Comparison MatrixOne 5x6 matrix, sourced cell by cell.
Most comparisons cover two or three algorithms, omit memory complexity, or skip real production users. The matrix below pulls all five onto a single grid with the axes that actually drive the decision: memory cost in Big-O terms, burst behavior, accuracy, the Redis structures each implementation needs, and a real-world example user for each.
| Algorithm | Memory · accuracy | Bursts · best for |
|---|---|---|
| Fixed Window CounterSTRING + Lua | O(1), 1 key · Approximate | 2x burst at the window boundary. Best for internal limits and login throttles where the boundary edge case is acceptable. Example: simple middleware limiters. |
| Sliding Window LogSORTED SET + Lua | O(n), 1 sorted set · Exact | No burst artifact. Best for strict audit APIs and high-value B2B endpoints. Memory grows with request volume, so it is impractical under high traffic. |
| Sliding Window Counter2 STRINGs + Lua | O(1), 2 keys · Near-exact | Smoothed boundary, constant memory. Best for general public APIs. Example: Cloudflare runs this across its global network. |
| Token BucketHASH + Lua | O(1), 1 hash (2 fields) · Exact | Controlled bursts up to the bucket cap. Best for developer APIs and SDKs. Example: Stripe recommends a client-side token-bucket limiter. |
| Leaky BucketHASH + Lua | O(1), 1 hash · Exact output rate | No bursts — excess dropped or delayed. Best for traffic shaping and downstream protection. Example: Shopify uses a leaky-bucket model for its API. |
03 — Redis ImplementationWhy every limiter wants Lua, not MULTI/EXEC.
All five algorithms share the same correctness hazard: a rate-limit check is a read-modify-write. Read the current count, decide whether to allow, write the new count. If two requests interleave between the read and the write, both can be allowed past a limit that should have stopped one of them — a classic time-of-check-to-time-of-use (TOCTOU) race.
The Redis-native fix is a Lua script run via EVAL. Lua executes atomically on the server in a single round trip, so there is no window for another client to slip in between the read and the write — no TOCTOU race, and no WATCH/MULTI/EXEC retry loop. Crucially, and unlike MULTI/EXEC, a Lua script can read a value and branch on it within the same atomic operation, which is exactly what every one of these algorithms needs to do.
Distributed rate limiting is fundamentally a consistency problem.Arcjet engineering blog
That framing is the right one to keep in mind. A limiter is only as correct as the consistency of the counter behind it. The single-node Lua approach gives you atomicity for free; the harder problem, covered in Section 07, is keeping counters consistent across many instances without making the counter store itself the bottleneck.
04 — Production EvidenceCloudflare's published accuracy data.
The strongest case for choosing approximate sliding-window counting over exact sliding-window logging at scale is Cloudflare's own published analysis. In a historical analysis of its sliding-window counter running across its global network, Cloudflare measured the following over 400 million requests from 270,000 sources.
Sliding Window Counter accuracy · Cloudflare production
Source: Cloudflare published analysisTwo numbers carry the recommendation. A 0.003% total error rate with a 6% average difference between the estimated and real rate means the approximation is comfortably good enough for enforcement — and the zero false positives means no legitimate source was wrongly blocked, which is the failure mode that actually costs you users. The lesson is that exactness is rarely worth O(n) memory when an O(1) approximation lands this close.
The architecture underneath is as instructive as the accuracy. Rather than a single centralised counter store, Cloudflare runs per-PoP counters — each point of presence keeps its own counts, because a single central store would collapse under Layer-7 attack traffic and add a single point of failure. Anycast routing tends to send the same client to the same PoP, which is what makes isolated per-PoP counters viable in the first place. The system handles several billion rate-limited requests per day and has mitigated attacks of 400,000 requests per second aimed at a single domain.
05 — Headers StandardThe standard you think you know is legacy.
Most engineering writing describes X-RateLimit-*as "the standard." It is not. It is a widely-copied de-facto convention that predates any actual specification. The real emerging standard is draft-ietf-httpapi-ratelimit-headers-11, published May 23, 2026 on the Standards Track by the IETF HTTPAPI Working Group — and it uses a completely different field syntax.
The draft defines two header fields. RateLimit reports the remaining service limit — a structured value with r for available quota and t for the effective window. RateLimit-Policy reports the quota policy — q for the total quota, w for the window, and an optional pk partition key. Cloudflare adopted this draft in September 2025; GitHub and Stripe have not, and still emit their own vendor-prefixed headers.
Ratelimit + Ratelimit-Policy"default";r=50;t=30x-ratelimit-*x-ratelimit-remaining: 4999Stripe-Rate-Limited-Reasonglobal-rateRateLimit + RateLimit-Policyr=50;t=30 · q=100;w=60| Provider | Headers · format | 429 & compliance |
|---|---|---|
| Cloudflare | Ratelimit + Ratelimit-Policy"default";r=50;t=30 | Returns Retry-After on 429 (seconds until refresh). Compliant with the IETF draft — adopted September 2025. |
| GitHub | x-ratelimit-*x-ratelimit-remaining: 4999 | limit / remaining / used / reset / resource. reset is a UTC epoch timestamp, not a delay. Not adopted — legacy X- prefix. |
| Stripe | Stripe-Rate-Limited-Reasonglobal-rate | Reason values: global-concurrency, global-rate, endpoint-concurrency, endpoint-rate, resource-specific. SDKs handle retries internally. Not adopted — vendor prefix. |
| IETF draft-11 | RateLimit + RateLimit-Policyr=50;t=30 · q=100;w=60 | Separate Retry-After. This is the standard itself — Standards Track, not yet an RFC, so treat field names as still-stabilising. |
x-ratelimit-reset is a Unix epoch timestamp, not a seconds-until-reset delay — you must compute reset − now() yourself. Stripe does not document a Retry-After header; its SDKs retry internally, so do not assume one is present. And the IETF draft is Standards Track but not yet an RFC — it is not RFC 7807 (that is a different spec, Problem Details for HTTP APIs).Clients MUST NOT consider the available quota parameter as a service level agreement.IETF draft-ietf-httpapi-ratelimit-headers-11
That line is the whole philosophy of the spec in one sentence. The headers are a hint to be polite with, not a contract you can lean on. The draft even warns that "many throttled clients may come back at the very moment specified," which is why effective-window calculations need jitter — the same principle that governs client-side retries in Section 08.
06 — Provider LimitsGitHub, Stripe, and Cloudflare, by the numbers.
Three of the most-integrated APIs publish their limits in detail, and the shapes are instructively different. GitHub layers a complex set of secondary limits over a simple hourly bucket; Stripe overrides a global per-second cap with much lower per-endpoint limits; Cloudflare enforces a hard five-minute window that locks out all API calls on breach.
Authenticated REST tier
Unauthenticated is 60 req/hour; a personal access token gets 5,000/hour; GitHub Enterprise Cloud apps get 15,000/hour; the GITHUB_TOKEN in Actions is 1,000/hour per repository.
The limit you actually hit
On top of the hourly bucket: <=100 concurrent requests, <=900 REST points/minute (GET=1pt, POST/PATCH/PUT/DELETE=5pt), <=90 CPU seconds per 60 real seconds, <=80 content-creating requests/minute.
Global live-mode limit
Live mode is 100 req/s (sandbox 25 req/s). Per-endpoint overrides go lower: Files and Search at 20 read/s, Create Payout at 15 req/s. Meter events are a separate 1,000 calls/s pool.
Per user/account token
1,200 requests per 5-minute window per token, 200 req/s per IP, GraphQL capped at 320 per 5 minutes. Exceeding any limit returns 429 and blocks all API calls for the next five minutes.
The pattern worth internalising is cost-aware rate limiting: limit by the resource cost of a request, not just its count. GitHub's secondary limit charges a POST, PUT, PATCH, or DELETE five points against a GET's one — so a write-heavy integration can exhaust the points budget long before it touches the hourly request bucket. Stripe applies the same idea differently, capping resource-intensive endpoints like Search and Files well below the global per-second limit. If you are building a limiter for your own API, weighting expensive operations is usually more important than the raw request ceiling.
07 — Distributed EnforcementCounters across many instances.
A single-process limiter is easy; the hard version is enforcing one shared limit across many instances. Two patterns dominate. The first is a shared counter store — typically Redis — synchronised across all nodes. Kong Gateway's rate-limiting-advanced plugin takes this route, using Redis clusters to synchronise counters across all Kong data-plane nodes, and can identify clients by consumer, credential, IP address, service, route, or a custom header value. That is the standard gateway-layer pattern for multi-instance deployments.
The second pattern is isolated counters plus consistent routing, which is what Cloudflare's per-PoP design demonstrates at the far end of the scale curve. Each node enforces independently, and routing keeps a given client landing on the same node often enough that isolated counts stay close to a global view. It trades a little accuracy for resilience — the right call when the threat model is volume rather than precision.
One sharp warning applies here. Nginx is not distributed. Its limit_req_zone and limit_req directives implement a leaky-bucket limiter, where the burst parameter queues excess requests and nodelay drops them immediately — but state is per-process, per-server. For distributed enforcement you need Redis behind it via OpenResty and lua-resty-limit-traffic, or an upstream API gateway. Presenting plain Nginx as a distributed solution is a common and costly mistake.
Shared Redis counters
Synchronise counts across all nodes through a Redis cluster — the Kong rate-limiting-advanced approach. Identify clients by consumer, credential, IP, service, route, or header. The standard gateway-layer pattern.
Per-node isolated counters
Each node counts independently; consistent routing keeps clients on the same node. Cloudflare's per-PoP design. Trades a little accuracy for resilience under Layer-7 attack volume.
Nginx leaky bucket
limit_req_zone + limit_req gives a per-process leaky bucket with burst queueing. Fine for one box — but not distributed. Add Redis via OpenResty or an upstream gateway the moment you scale horizontally.
08 — Client-SideBackoff, jitter, and graceful degradation.
Server-side limits are only half the system. How clients react to a 429 decides whether a transient limit becomes a self-inflicted outage. The canonical mistake is naive retry — every throttled client retrying on the same fixed schedule, re-colliding in lockstep. The fix is exponential backoff with jitter.
AWS research on backoff strategies is the reference here. With 100 contending clients, "Full Jitter" — where the sleep is random(0, min(cap, base × 2^attempt))— reduced the total call count by over 50% compared to non-jittered exponential backoff, and non-jittered backoff was "the clear loser" among the options tested. AWS describes three variants: Full Jitter, Equal Jitter, and Decorrelated Jitter. The mechanism is simple: randomness spreads clustered retry attempts into a steady-state pattern, flattening the contention spikes that synchronised retries create.
When backoff is not enough — sustained throttling rather than a transient spike — graceful degradation takes over. The practical toolkit is to serve stale cached responses, reduce response fidelity by dropping non-essential fields, fan out to a secondary provider, and wrap the dependency in a circuit breaker that fails fast once retries are consistently futile rather than queueing requests that will never succeed. Circuit breakers exist precisely for sustained unavailability, where continuing to retry only deepens the contention.
None of this works blind. Observability for rate limiting spans three layers — the proxy or gateway statistics (Envoy, Kong, Nginx), the rate-limit service counters in Redis, and Redis's own health metrics. The signals worth alerting on are the rate-limited request percentage per endpoint, retry success rates, latency percentiles under throttling, and circuit-breaker state transitions. The client-side complement — how to make retries idempotent so a backoff loop never double-applies an effect — is its own discipline, covered in our guide to webhook reliability, idempotency, and retries.
09 — Decision GuidePicking an algorithm by workload.
The algorithm choice should follow the load shape and the failure you most want to avoid, not a default copied from the last project. Four common workloads map cleanly onto the five algorithms.
General-purpose throttling
Sliding Window Counter — O(1) memory, near-exact accuracy, a smoothed boundary, and Cloudflare's production track record. This is the default unless a specific requirement pushes you elsewhere.
Bursty, average-bounded usage
Token Bucket. It lets clients spend a burst of accumulated capacity for spiky workloads while still enforcing a long-term average — the behavior developers expect from a metered API.
Strict output shaping
Leaky Bucket. When bursts are a threat to a downstream service rather than a feature, the constant drain rate is exactly what you want — drop or delay, never amplify.
Exactness is non-negotiable
Sliding Window Log when correctness must be perfect and traffic per client is bounded; Fixed Window only for internal limits where the 2x boundary burst genuinely does not matter.
The forward-looking signal is the header standard, not the algorithms. The algorithm set has been stable for years and is unlikely to change; what is shifting is interoperability. As more providers follow Cloudflare onto the IETF RateLimit / RateLimit-Policyheaders, the cost of building a client that talks to many APIs should fall — but for now, any multi-provider client still has to special-case GitHub's epoch-timestamp resets and Stripe's SDK-internal retries alongside the new standard. Teams running AI inference APIs feel this most acutely, where per-tenant quota enforcement is a core cost-control lever; we cover that economics in our AI inference cost optimization playbook, and the shared Redis infrastructure pattern in our Redis caching strategies guide. If you are designing this layer for a production system, our web development engagements start with exactly this kind of architecture review.
10 — ConclusionA solved problem with one moving part.
Pick the algorithm for the load shape, and standardise the headers.
Rate limiting is a mature discipline with a stable answer set. Five algorithms cover the field, and the choice among them is governed by two questions: how much memory you can spend per client, and whether bursts are a feature or a threat. For most public APIs the answer is Sliding Window Counter — O(1) memory, near-exact accuracy, and a production track record at Cloudflare scale that few alternatives can match.
The implementation details that bite are well-understood once named: use a Lua script for atomicity rather than a MULTI/EXEC retry loop, weight expensive operations the way GitHub charges five points for a write, and never mistake Nginx's per-process limiter for a distributed one. On the client side, exponential backoff with jitter is not optional — AWS's own research shows non-jittered backoff is the clear loser, and a circuit breaker is what keeps sustained throttling from becoming an outage.
The one genuinely moving part is interoperability. The X-RateLimit-* convention most engineers treat as standard is legacy; the actual standard, IETF draft-11 from May 2026, uses different field names and is only beginning to see adoption. Build your limiter on the algorithm that fits your load, instrument all three observability layers, and write your client to tolerate the header divergence — because that divergence is the part of this problem still being settled.