Robust API error handling is the difference between a slow dependency and a cascading outage. Every distributed system eventually returns a timeout, a 503, or a duplicate request — the question is whether your client absorbs that failure gracefully or amplifies it. This 2026 engineering reference assembles the resilience stack that the major cloud providers and the IETF now treat as table stakes.
The stakes have risen because retries are not free. The Amazon Builders’ Library is blunt about the math: three retries stacked across three layers of an application can turn a single request into a 243× load increase on a downstream database. Get the retry policy wrong and your own failover logic becomes the attack. Standards have caught up to the problem — RFC 9457 gives errors a machine-readable shape, and an IETF Standards Track draft is formalizing the idempotency header that Stripe popularized.
This guide covers six pillars: standardized error bodies with RFC 9457, a status-code retry guide, backoff with jitter, idempotency keys, circuit breakers, and timeout strategy unified under observability. Each section names the failure mode you inherit if you skip it. Every number and quote below is sourced from a primary specification or a first-party vendor engineering document.
- 01RFC 9457 is the standard for machine-readable errors.Published July 2023, it obsoletes RFC 7807 and defines five fields — type, title, status, detail, instance — served as application/problem+json. Spring Boot 6+ and ASP.NET Core 7+ ship it natively, so adoption is near-zero cost.
- 02Retry without jitter and you build a thundering herd.AWS, Google Cloud, and Stripe all mandate exponential backoff plus random jitter. The AWS Well-Architected Framework classifies missing jitter as a 'Common anti-pattern' carrying a 'High' level of risk.
- 03Retry at exactly one layer, not several.The Amazon Builders' Library warns that 3 retries across 3 layers compounds into a 243× load increase on databases. Pick one well-defined layer to own the retry, and disable it everywhere else.
- 04Idempotency keys are the prerequisite to safe retries.You can only safely retry an operation that is idempotent. Stripe uses a UUID v4 in the Idempotency-Key header; an IETF Standards Track draft (idempotency-key-header-07, October 2025) is formalizing the field for everyone.
- 05Circuit breakers and observability close the loop.Martin Fowler's circuit breaker trips after a failure threshold so calls fail fast instead of piling up; OpenTelemetry unifies the logs, traces, and metrics you need to see retry storms and breaker state changes as they happen.
01 — The ModelError handling is a six-pillar system, not a checklist.
Most guidance treats API errors as a list of status codes to memorize. That misses the architecture. Resilient APIs layer six patterns that each defend against a distinct failure mode, and the value of the stack comes from how they compose: standardized errors tell the client what happened, retries with backoff recover from transient faults, idempotency makes those retries safe, circuit breakers stop you from hammering a service that is already down, timeouts bound how long you wait, and observability lets you see all of it in production.
The framing below names each pillar by the failure mode you inherit if you omit it. That column is the whole point — it forces you to justify each pattern by its blast radius, not by its benefits.
Problem Details
Machine-readable error bodies with five standard fields. Omit it and every client writes bespoke parsing for your one-off JSON error shapes.
Retry + Jitter
Recover from transient faults without synchronizing clients. Omit jitter and synchronized retries become a thundering herd on a recovering service.
Idempotency Keys
Make a retried write safe to repeat. Omit it and a retried payment or order creation charges or duplicates the record twice.
Circuit Breaker
Fail fast when a dependency is already down. Omit it and every request waits for a full timeout against a service that cannot respond.
Timeout Strategy
Bound how long any call may block. Omit it and one slow downstream pins your threads and exhausts the connection pool.
Observability Layer
See retry storms and breaker trips as they happen. Omit it and a retry storm looks like a healthy traffic spike until the database falls over.
02 — Problem DetailsRFC 9457 gives errors a machine-readable shape.
RFC 9457, “Problem Details for HTTP APIs,” was published in July 2023 and obsoletes the older RFC 7807. It is an Internet Standards Track document approved by the IESG, and it solves a small but pervasive problem: every API used to invent its own JSON error shape, so every client had to write bespoke parsing. RFC 9457 defines one canonical shape, served with the content type application/problem+json (or application/problem+xml for XML APIs).
A problem detail object carries five standard members: type (a URI that identifies the problem type, defaulting to about:blank), title, status (the HTTP status code repeated as a JSON number), detail, and instance. Problem-type definitions may add custom extension members, and clients must ignore any extension they do not recognize — which is what makes the format forward-compatible.
"This document defines a 'problem detail' to carry machine-readable details of errors in HTTP response content to avoid the need to define new error response formats for HTTP APIs."— IETF RFC 9457, Abstract
What changed from RFC 7807 is documented in Appendix D of the new spec: a registry for common problem-type URIs, clarified guidance for returning multiple problems in a single response, and explicit handling for non-dereferenceable type URIs. None of these break the 7807 wire format, so migration is additive rather than a rewrite.
The adoption story is the reason to act now. Spring Boot 6+ supports RFC 9457 natively — flip spring.mvc.problemdetails.enabled=true in application.properties, or extend ResponseEntityExceptionHandler with a @ControllerAdvice. ASP.NET Core 7+ also ships built-in Problem Details support. For most modern stacks, the cost of moving from ad-hoc error JSON to a standard machine-readable body is close to a configuration flag. If your team is standardizing API contracts across services, our web development engineering work treats error-body consistency as part of the API design, not an afterthought.
03 — Status-Code Retry GuideWhich codes are safe to retry.
Retry logic begins with classification. Transient errors — the service was momentarily overloaded or a gateway timed out — are safe to retry. Non-transient errors — a malformed request, a missing resource, a failed authorization — will fail identically no matter how many times you resend them, so retrying only wastes resources and delays the real error reaching the user. Google Cloud and Zuplo converge on the same split: retry on 408, 429, 500, 502, 503, and 504; do not retry 400, 401, 403, 404, or 422.
The table below is our proprietary decision matrix. It goes beyond a plain retryable list by adding two columns most guides leave out: whether the code commonly carries a Retry-After header you should honor, and whether retrying the request safely requires idempotency. Those three decisions — retry, wait, deduplicate — rarely appear together in one place.
| Code | Name | Retry safe? | Honor Retry-After? | Idempotency needed? | Typical cause |
|---|---|---|---|---|---|
| Retryable — transient failures | |||||
| 408 | Request Timeout | Yes | No | For writes | Client took too long to send the request |
| 429 | Too Many Requests | Yes | Yes | For writes | Rate limit exceeded |
| 500 | Internal Server Error | Yes | No | For writes | Unhandled server-side fault |
| 502 | Bad Gateway | Yes | No | For writes | Upstream returned an invalid response |
| 503 | Service Unavailable | Yes | Yes | For writes | Service overloaded or in maintenance |
| 504 | Gateway Timeout | Yes | No | For writes | Upstream did not respond in time |
| Non-retryable — fix the request, do not resend | |||||
| 400 | Bad Request | No | No | N/A | Malformed or invalid request payload |
| 401 | Unauthorized | No | No | N/A | Missing or invalid credentials |
| 403 | Forbidden | No | No | N/A | Authenticated but not permitted |
| 404 | Not Found | No | No | N/A | Resource does not exist |
| 422 | Unprocessable Entity | No | No | N/A | Semantically invalid request data |
The Retry-After column maps to the two response formats MDN documents: a delta-seconds integer or an absolute HTTP-date. For a 429, short values in the 5-to-120-second range are typical; for a 503 in planned maintenance, anything from 30 seconds to several minutes is reasonable. The “for writes” idempotency column is the one engineers most often miss — a retried GET is harmless, but a retried POST against a non-idempotent endpoint can create a duplicate, which is exactly what the next two sections address.
04 — Backoff and JitterBackoff with jitter, and the retry tax.
Once you know a code is retryable, the question is how to space the retries. A fixed-interval retry synchronizes every failing client onto the same schedule, so they all hammer the recovering service at the same instant — the thundering-herd problem. The fix has two parts. Exponential backoff increases the wait after each attempt; jitter randomizes that wait so clients spread out across time instead of retrying in lockstep.
Google Cloud’s Storage client libraries document a concrete baseline: a 1-second initial delay, a 2× multiplier per iteration, and a maximum per-attempt delay in the 30-to-60-second range depending on the library. The maximum total retry time is a client-library default, not a universal recommendation — it ranges from 120 seconds in the Python client to 15 minutes in the C++ client. Treat those as starting points to tune against your own latency budget, not as gospel.
First backoff interval
Google Cloud Storage client libraries start the backoff at a 1-second delay before the first retry, then grow it from there.
Growth per iteration
Each successive retry doubles the wait. Add random jitter on top so synchronized clients de-correlate instead of retrying in lockstep.
3 retries × 3 layers
The Amazon Builders' Library shows how retries stacked across application layers compound into a 243x load increase on a downstream database.
The retry tax is the part most teams underestimate. The Amazon Builders’ Library puts a hard number on it: three retries stacked across three layers of an application compound into a 243× load increase on a downstream database, because each layer re-multiplies the attempts of the one below it. The remedy is structural, not numeric: retry at exactly one well-defined layer of the stack and disable retries everywhere else, so a single transient blip cannot snowball into a self-inflicted denial-of-service.
"Retries are selfish — they demand more server resources to increase individual request success rates."— Amazon Builders' Library
That framing reframes the whole policy. A retry is a bet that spends shared server capacity to improve one client’s odds. The AWS Well-Architected Framework is explicit that adding jitter is not optional polish: it classifies failing to add jitter as a Common anti-pattern with a High level of risk, and warns specifically against retrying at multiple layers, which it says compounds attempts into a retry storm. Capping the maximum number of attempts and the total elapsed retry time bounds the bet so a degraded dependency cannot bankrupt the caller.
05 — Idempotency KeysIdempotency makes a retry safe to repeat.
Backoff tells you when to retry; idempotency tells you whether you may. An idempotent operation produces the same result whether it runs once or ten times — a property that GETs and most DELETEs have by nature, but that a POST creating a payment or an order does not. The standard fix is an idempotency key: a unique token the client attaches to a write so the server can recognize a repeat and return the original result instead of executing the work twice.
"An idempotent operation is one where a request can be retransmitted or retried with no additional side effects."— Malcolm Featonby, Principal Engineer at AWS
Stripe’s implementation is the reference most engineers know: a V4 UUID supplied in the Idempotency-Key request header, with keys capped at 255 characters and retained for at least 24 hours. A subtle but important detail — Stripe does not serve failed validation responses from the idempotency cache; only responses generated after endpoint execution begins are stored, so a request rejected before it does any work can be safely corrected and resent.
What many practitioners miss is that this pattern is being standardized. An IETF Standards Track draft, draft-ietf-httpapi-idempotency-key-header-07 (published October 15, 2025), formalizes the Idempotency-Key header as an Item Structured Header per RFC 8941 and recommends UUID v4. It is a draft RFC in progress, not a finalized standard — the version-07 document is on the Standards Track and was set to expire in April 2026 — and it deliberately leaves key length and retention to individual API specifications rather than mandating them globally.
UUID per request
Stripe uses a V4 UUID in the Idempotency-Key header; the IETF draft recommends the same. Composite keys like order-create-{orderId} are a valid alternative.
Stripe key cap
Stripe caps the Idempotency-Key value at 255 characters. The IETF draft leaves length limits to each API spec rather than mandating one.
Minimum lifetime
Stripe retains keys for at least 24 hours; Google Cloud guidance suggests a 24-to-48-hour lifetime, stored with a TTL. Same key + different body must be rejected.
The cloud providers bake idempotency into their own control planes. AWS EC2 RunInstances and ECS RunTask both accept an optional ClientToken parameter for exactly this purpose, and the AWS CLI auto-generates one when you do not supply it. Google Cloud classifies its Cloud Storage operations into three tiers: always idempotent (gets, lists, bucket insert and delete, IAM policy tests), conditionally idempotent (updates and patches gated on an IfMetagenerationMatch or etag precondition), and never idempotent (HMAC key creation, Pub/Sub notification creation, ACL changes without preconditions). The temporal warning is the right rule of thumb to carry into design reviews.
"Never, ever retry non-idempotent operations."— Temporal Engineering Blog
If you are building webhook delivery, payment flows, or any write that a client might resend, idempotency is non-negotiable. Our deeper treatment of delivery semantics lives in our companion guide on webhook reliability and idempotency patterns, and the 429-handling side of the equation is covered in API rate limiting strategies.
06 — Circuit BreakersWhen to stop calling a failing service.
Retries help when a fault is momentary. They actively hurt when a dependency is genuinely down, because every retry burns a thread waiting for a timeout against a service that cannot answer. The circuit breaker, popularized in software by Martin Fowler, is the pattern that detects that condition and stops calling.
"Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all."— Martin Fowler
A circuit breaker has three states. In Closed, calls pass through normally while the breaker counts failures. When failures cross a threshold the breaker moves to Open, where every call fails fast without ever reaching the protected service â sparing both the caller’s threads and the dependency’s recovery. After a cool-down it enters Half-Open and allows a single trial call: if that succeeds the breaker closes and normal traffic resumes, if it fails the breaker re-opens. The exact threshold and cool-down are tunable to your traffic — there is no universal “five failures in ten seconds” default, and you should set them from your own error rates rather than copying an example.
Closed
Calls flow through to the protected service while the breaker monitors the failure rate. This is the steady state for a healthy dependency.
Open (fail fast)
All calls return an error immediately without reaching the service. Fast failure protects caller threads and gives the dependency room to recover.
Half-Open
After a cool-down, a single test call is allowed through. Success closes the breaker and resumes traffic; failure re-opens it and the cool-down restarts.
The circuit breaker pairs naturally with the bulkhead pattern, which isolates resources into separate pools so one failing dependency cannot starve the others — give Service A 50 threads, Service B 30, Service C 20, and a flood of timeouts against A leaves B and C untouched. Resilience4j implements this through its BulkheadRegistry. Together, breaker and bulkhead convert a cascading failure into a contained, observable one.
07 — TimeoutsBounding the wait with deadlines.
Every pattern above assumes a call eventually returns. Timeouts are what guarantee that. The combined AWS and Temporal guidance is specific: set the connection timeout and the request timeout separately, because a connection that never establishes is a different failure from a request that hangs after connecting. Pick the request timeout from the downstream service’s own latency distribution — its p99 or p99.9 — rather than a round number, so a normal-but-slow response is not mistaken for a failure.
The most overlooked technique is deadline propagation. When an inbound request carries a deadline, propagate it to every downstream call so a service does not waste compute finishing work for a request the caller has already abandoned. Without propagation, an upstream timeout leaves a chain of downstream services grinding through work whose result no one will ever read — the silent waste that turns a small latency spike into a resource exhaustion event.
08 — ObservabilityYou cannot fix what you cannot see.
The five preceding pillars are only as good as your ability to watch them in production. OpenTelemetry has become the vendor-neutral standard for that: logs correlated by trace and span ID, structured attributes following semantic conventions, and all three pillars — logs, traces, and metrics — unified so you can pivot from a metric anomaly to the exact trace that caused it. The OpenTelemetry community marked its database semantic conventions stable in 2025, which makes cross-signal correlation reliable enough to build alerts on.
Resilience has its own telemetry. Beyond generic request rates, the signals that tell you whether your error handling is working are: timeout rate and p99 latency; retry rate and how often you hit the max-attempt cap; circuit-breaker state changes and rejected calls; dead-letter-queue depth and message age; and fallback invocation and success rates. A retry storm shows up as a retry-rate spike with a flat success rate long before it shows up as a customer complaint.
Retry rate + max-attempt hits
Watch retry rate against success rate. A climbing retry rate with flat success is the early signature of a thundering herd before the database falls over.
Circuit-breaker state changes
Track Open/Half-Open transitions and rejected calls. Frequent flapping between states means your threshold or cool-down is mistuned for current traffic.
Dead-letter-queue depth and age
A growing DLQ depth or message age means downstream processing is failing faster than it recovers. This is your buffer of last resort filling up.
Timeout rate and p99 latency
Rising timeout rate at the p99 means a downstream is degrading. Correlate by trace ID via OpenTelemetry to find which dependency, not just that one exists.
Instrumented well, these signals turn the six-pillar model from a design document into an operational dashboard. The reliability-metrics and SLO side of this — turning these raw signals into error budgets and alert thresholds — is the subject of our companion reference on SLO design and reliability metrics, and the broader contract design that frames all of it lives in our REST API design best practices guide.
09 — ConclusionResilience is a system, built from standards.
Error handling has graduated from tribal knowledge to a spec-backed system.
The throughline of this reference is that API resilience is no longer a grab-bag of war stories. It is six composable pillars — standardized errors, retry with jitter, idempotency, circuit breakers, timeouts, and observability — each backed by a primary specification or a first-party vendor document, and each defending against a named failure mode. RFC 9457 standardized the error body in 2023; the IETF is now standardizing the idempotency header; and libraries like Resilience4j collapse the runtime patterns into one toolkit.
The most expensive lesson in the whole stack is the cheapest to state: naive retries amplify failure rather than absorbing it. Three retries across three layers becomes a 243× load increase, jitter is not optional, and you should retry at exactly one layer of the stack. Most self-inflicted outages trace back to a retry policy that was trying to help.
The practical move is to treat these six pillars as a design-review checklist with teeth: for each one, name the failure mode you accept if you skip it. A standardized error body costs a configuration flag in modern frameworks; an idempotency key costs a header and a cache; jitter costs a single line of randomization. The blast radius of omitting any of them is measured in outages. In 2026 the patterns are settled and the libraries are mature — the remaining work is operating them deliberately under real load.