Webhook reliability comes down to a single uncomfortable fact: every major provider delivers events at least once, never exactly once, which means your consumer will eventually receive the same event twice and must handle it without double-charging, double-shipping, or double-emailing. Idempotency is not a nice-to-have — it is the load-bearing wall of any production webhook integration.
The cost of getting this wrong is concrete. A payment event processed twice becomes a duplicate charge. An order-created event processed twice becomes a duplicate fulfillment. And the most common cause is not a network glitch you can see — it is your own handler finishing the work but answering the provider a few milliseconds too late, so the provider assumes failure and retries an operation that already succeeded.
This reference assembles the four reliability layers — deduplication, retry with backoff, dead-letter queues, and signature verification — into decision tables you can build against. We compare Stripe, Shopify, Svix, and Amazon SQS side by side, then give a status-code decision tree, recommended DLQ thresholds, and the accept-then-queue architecture that absorbs all of it.
- 01At-least-once delivery is universal — exactly-once is not.Stripe's own docs state endpoints 'might occasionally receive the same event more than once.' AWS SQS Standard queues say the same. Treating delivery as exactly-once is a category error; build for duplicates by default.
- 02Idempotency is the consumer's job, keyed on a stable event ID.Every well-designed webhook carries a unique event ID that stays constant across retries — Stripe's event id, Shopify's X-Shopify-Webhook-Id, Svix's webhook-id. Store it; skip reprocessing on a repeat. The dedup cache TTL must outlive the provider's full retry window.
- 03The timeout-induced duplicate is the most dangerous failure mode.If your handler completes the work but the HTTP response lands after the provider's timeout (often 5–15s), the provider retries and your business logic runs twice. The fix is counter-intuitive: acknowledge with 200/202 immediately, then process asynchronously.
- 04Retry policy differs sharply by provider — read the matrix, not the headline.Stripe retries for up to 3 days in live mode; Shopify retries 8 times over 4 hours and may auto-delete Admin-API subscriptions after 8 straight failures; Svix runs roughly 8 attempts across about a day. Configure DLQ retention and dedup TTL to each provider, not to a single global number.
- 05Signature verification needs constant-time comparison.HMAC-SHA256 is near-universal, but headers and signed-payload composition are not standardized — Stripe prepends a timestamp, Shopify does not. Always compare signatures with hmac.compare_digest or crypto.timingSafeEqual to avoid timing-attack leakage.
01 — The Delivery GuaranteeWhy exactly-once delivery is impossible.
Before designing anything, internalize the constraint: no webhook provider can guarantee exactly-once delivery, because exactly-once delivery is a proven impossibility in distributed systems, not a feature waiting to be built. The result traces back to the Two Generals Problem and the FLP impossibility theorem published in 1985, which showed that consensus cannot be guaranteed in an asynchronous network where even one participant may fail.
The practical consequence is a clean distinction worth carrying into every design review: exactly-once is achievable as a processing guarantee, never as a delivery guarantee. The wire will sometimes deliver a duplicate. What you control is whether processing that duplicate produces a second side effect. That is the entire job of idempotency, and it sits on the consumer side — the receiver — not the sender.
Exactly-once is not a delivery guarantee. It is a processing guarantee.— Hookdeck, At-Least-Once vs. Exactly-Once Webhook Delivery Guarantees
Both major-vendor documentation and the queueing layer agree explicitly. Stripe's webhook docs warn that an endpoint "might occasionally receive the same event more than once." Amazon SQS Standard queues describe the same behavior at the infrastructure level: because messages are stored redundantly across multiple servers, a copy can reappear if the server holding a given message is unavailable during deletion. AWS's guidance is blunt — design applications to be idempotent so that processing the same message more than once does no harm.
02 — Provider Decision MatrixOne table for four providers, every reliability column.
The single most useful artifact for a multi-vendor webhook integration is a matrix that puts retry policy, signature scheme, and deduplication semantics side by side. No two providers agree on header names, signed-payload composition, or retry windows — so the integration cost is real, and it compounds per provider. Truto, a multi-SaaS integration platform, reports in-house integration maintenance running upward of $50,000 per integration as the hidden driver behind these inconsistencies. Treat that figure as vendor-stated context rather than an independent benchmark, but the direction is right: heterogeneity is expensive.
| Dimension | Stripe | Shopify | Svix | AWS SQS (Standard) |
|---|---|---|---|---|
| Delivery guarantee | At-least-once | At-least-once | At-least-once | At-least-once |
| Retry window | Up to 3 days (live mode) | ~4 hours | ~1 day (≈27h) | Bounded by visibility timeout + maxReceiveCount |
| Retry count | Many (3 in sandbox) | 8 attempts | ~8 attempts | Per maxReceiveCount |
| Backoff | Exponential | Exponential | Stepped schedule | Visibility-timeout driven |
| Signature header | Stripe-Signature | X-Shopify-Hmac-SHA256 | svix-signature | IAM / SQS auth (no HMAC header) |
| Signing algorithm | HMAC-SHA256 | HMAC-SHA256 (base64) | HMAC-SHA256 | AWS SigV4 (transport) |
| Signed payload | timestamp.body | raw body | id.timestamp.body | n/a |
| Replay tolerance | 5 minutes (default) | Use X-Shopify-Triggered-At | Timestamp in signature | n/a |
| Dedup ID header | Event id | X-Shopify-Webhook-Id | webhook-id (stable) | MessageId / dedup ID (FIFO) |
| Auto-disable | After sustained failure | After 8 fails (Admin-API subs) | Per endpoint config | DLQ via maxReceiveCount |
| Min dedup cache TTL | ≥ 3 days | ≥ 4 hours (24h safe) | ≥ 24 hours | ≥ retention window |
03 — Consumer-Side IdempotencyDeduplicate on a stable event ID.
The mechanism is simple and the discipline is everything: every well-designed webhook carries a unique event ID that stays the same across retries. That identifier is precisely what distinguishes a retry from a genuinely new event. On receipt, you record the ID; if you have seen it before within your dedup window, you acknowledge and do nothing. The provider gets its 200, your business logic runs exactly once.
Two storage strategies cover almost every case. A database unique constraint on the event ID suits transactional operations — the insert either succeeds (first time) or violates the constraint (duplicate), and you branch on that atomically. A Redis key with a TTL suits high-throughput streams where a database round-trip per event is too costly. The choice is about throughput and consistency needs, not correctness — both work.
Every well-designed webhook includes a unique event ID. This identifier stays the same across retries, which is what distinguishes a retry from a genuinely new event.— Svix Webhook University, Idempotency and Deduplication
Sender-side idempotency keys are the mirror image and worth understanding even as a consumer. Stripe's API accepts an Idempotency-Key header (max 255 characters, a V4 UUID or other high-entropy string, never containing sensitive data such as an email). Stripe saves the resulting status code and body of the first request for that key — including a 500 — and replays it for repeat requests, with a 24-hour TTL after which the key becomes purgeable. The idempotency layer also compares incoming parameters to the original request and errors if they differ, preventing accidental key reuse across different operations.
04 — Retry, Backoff & JitterExponential backoff, plus jitter to break the herd.
When a delivery fails, the sender retries on an increasing delay — exponential backoff — so a struggling consumer is not hammered. But naive exponential backoff has a failure mode of its own: if a thousand events fail at the same instant, they all retry at the same computed delay, producing a synchronized thundering herd that knocks the recovering endpoint over again. Jitter randomizes those delays so the load spreads out.
Full jitter
Picks a delay uniformly between zero and the full exponential value. Maximum spread, the most common choice in production senders because it most effectively dissolves synchronized retries.
Equal jitter
Guarantees at least half the exponential delay, then randomizes the rest. A compromise when you want some minimum spacing but still want to break synchronization.
Decorrelated jitter
Each delay is computed from the previous one rather than the attempt number, producing a smoother growth curve. Useful when you want adaptive spacing without a hard exponential ceiling.
On the receiving end of a managed gateway, the configurable budget is generous. Production webhook systems typically cap individual retry intervals between 6 and 12 hours with a total window of 1 to 3 days, and disable an endpoint after roughly 3 to 5 days of sustained failure. As a point of reference for what is achievable, Hookdeck's gateway documents up to 50 delivery attempts over as long as a week — that is a product maximum, not an industry norm, and it is far more aggressive than what Stripe or Shopify do natively.
05 — Status-Code Decision TreeWhich response codes are retriable — and the two surprises.
Most guides collapse this into "2xx good, 4xx don't retry, 5xx retry." That heuristic is wrong in two important places: a 408 Request Timeout and a 429 Too Many Requests are both 4xx codes that should be retried. The table below is the version that survives production. Note that most providers, including Stripe, treat 3xx redirects as non-retriable failures — point your webhook at the final URL, never a redirect.
| Response | Retry? | Reason | Recommended consumer action |
|---|---|---|---|
| 2xx | No | Acknowledged | Return 200/202 fast; do real work async |
| 3xx | No | Treated as failure by most providers | Register the final URL, never a redirect |
| 400 / 401 / 403 / 404 / 410 | No | Client error — retry will not help | Fix the endpoint or auth; alert, do not loop |
| 408 Request Timeout | Yes | Transient — exception to the 4xx rule | Allow retry with backoff |
| 429 Too Many Requests | Yes | Rate-limited — exception to the 4xx rule | Honor Retry-After header |
| 5xx | Yes | Server error — likely transient | Retry with exponential backoff + jitter |
| Connection / DNS failure | Yes | No response — transient by assumption | Retry; trip circuit breaker if sustained |
For outbound senders, pair this tree with a per-endpoint circuit breaker: open the breaker when, say, 5 of the last 10 requests fail, hold it open for a cooldown of roughly 30 to 120 seconds, then half-open to test recovery before resuming full traffic. Per-endpoint scope matters in multi-tenant systems — one customer's broken endpoint should never throttle deliveries to everyone else.
06 — Dead-Letter QueuesCatch the poison events before they block the queue.
A dead-letter queue is where events go after exhausting their retries, so a single un-processable "poison" event does not wedge the main pipeline forever. In Amazon SQS the mechanism is a redrive policy with a maxReceiveCount: once a consumer has received a message that many times without deleting it, SQS moves the message to the DLQ. AWS guidance is to set maxReceiveCount high enough to permit genuine retries — at least 3 for standard queues — so transient errors are not misclassified as poison.
Two SQS subtleties bite teams in production. First, DLQ message expiration is computed from the original enqueue timestamp, not DLQ arrival — a message that spent a day in the source queue before failing has only the remaining retention left once it lands in the DLQ, so always set DLQ retention longer than source retention. Second, attaching a DLQ to a FIFO queue breaks strict ordering for the affected messages; decide whether ordering or poison-isolation matters more for that stream.
DLQ depth threshold
Recommended practice is to alert when DLQ depth exceeds roughly 10 events — a small standing backlog is the early signal that a downstream dependency or handler bug is dropping events into the dead-letter path.
Oldest-event age
Alert when the oldest event in the DLQ has sat unreviewed for more than about an hour. Age catches the slow leak that depth alone misses — one event that never gets triaged is a silent data-loss risk.
DLQ retention
A 14-day DLQ retention is a sensible recommendation versus a much shorter main-queue retention, with 30 days as a reasonable floor for webhook DLQs — long enough to investigate, replay, and reconcile before events expire.
The DLQ is not a graveyard — it is a replay buffer. Pair it with a tool that lets an operator inspect a dead-lettered event, fix the underlying handler bug, and redrive the event back through processing. Because every event is keyed on its idempotent ID, replaying a DLQ event that was actually processed before failing is safe: the dedup layer absorbs the duplicate.
07 — Signature VerificationHMAC-SHA256 everywhere, but no two providers agree on the details.
Signature verification proves an event genuinely came from the provider and was not forged or tampered with. The algorithm is nearly universal — HMAC-SHA256 keyed on a shared secret — but the surrounding conventions are a babel. Stripe signs {timestamp}.{raw_body} and delivers it in Stripe-Signature as t=<ts>,v1=<sig>. Shopify signs the raw body, base64-encodes the digest, and sends it in X-Shopify-Hmac-SHA256 — no timestamp prefix. GitHub uses X-Hub-Signature-256; Slack uses X-Slack-Signature. There is no shared standard to code against, which is exactly why a per-provider matrix earns its keep.
Two rules are non-negotiable. First, always verify against the raw request body bytes, before any JSON parsing or framework deserialization re-serializes and changes the bytes — a single whitespace difference breaks the HMAC. Second, compare the computed and received signatures with a constant-time comparison function — hmac.compare_digest in Python, crypto.timingSafeEqual in Node.js. A naive == comparison returns faster on an earlier-mismatching byte, leaking timing information an attacker can use to guess the signature byte by byte.
X-Shopify-Triggered-At so you can detect a stale payload during retries. If a provider signs a timestamp, check it.08 — The Core PatternAccept fast, then queue — the move that fixes everything.
The single most dangerous failure mode in webhook handling is the timeout-induced duplicate. Your handler receives the event, does the real work — charges the card, ships the order — and then takes a beat too long to respond. The provider's timeout fires (often 5 to 15 seconds; Shopify is stricter still at a 1-second connection timeout and a 5-second full-request timeout), it concludes delivery failed, and it retries an operation that already completed. The duplicate is born not from a network problem but from your own slowness.
The fix is counter-intuitive for anyone trained on synchronous HTTP: do not do the work in the request. Verify the signature, deduplicate on the event ID, persist the raw event to a durable queue, and return 200 or 202 immediately. Then process asynchronously from the queue. This decoupling is what lets a system absorb thousands of events per second; a well-built ingestion gateway can keep added latency under a few seconds for the overwhelming majority of events while doing all of this.
Accept the event immediately, push it onto a durable queue, and process it asynchronously.— Hookdeck, Webhook Infrastructure Guide
Once you adopt accept-then-queue, the other layers slot in cleanly. Signature verification happens at the edge before anything is enqueued. Idempotency is checked against the dedup store before the work runs. Failed processing increments a retry count and eventually lands in the DLQ. Each concern lives in one place. This is the same reliability layer that makes any event-driven integration production-safe — the kind of architecture we build into custom web and application development engagements and reuse when wiring real-time triggers in CRM automation workflows. If you are evaluating a build-versus-buy decision for the ingestion layer itself, our total-cost-of-ownership analysis for workflow automation walks through where webhook reliability lands on the ledger.
09 — Ordering & ReconciliationEvents arrive out of order — design for it.
Stripe states plainly that it does not guarantee events arrive in the order they were generated. A customer.subscription.updated can land before the createdthat logically preceded it. Building state machines that assume ordered arrival is a quiet recipe for corrupted state. The robust pattern is to treat each webhook as a trigger, not a source of truth: on receipt, fetch the object's current state from the provider's API and act on that, rather than trusting the event payload to reflect the latest reality. The same reliability layer underpins event-driven tools we have written about elsewhere — the inbound deliveries in our Slack event-subscriptions tutorial and the dispatch path in our MCP server TypeScript walkthrough both depend on exactly these idempotency and retry guarantees.
Looking ahead, the standards layer is slowly converging even as provider conventions stay fragmented. The CNCF's CloudEvents specification — version 1.0.2, released in 2022 and graduated as a CNCF project on January 25, 2024 — defines a common event envelope with fields such as id, source, type, time, and datacontenttype, and is adopted across Amazon EventBridge, Azure Event Grid, Google Cloud Eventarc, and Knative. That id field is, conveniently, the same stable identifier your dedup layer wants. As CloudEvents adoption widens, the per-provider header babel should narrow — though pragmatically, expect to keep a provider matrix for years yet.
Webhook as the sole trigger
Acceptable for low-stakes notifications — a Slack ping, a cache bust — where a missed event is survivable. Do not use this alone for money, inventory, or anything where a dropped event causes real harm.
Webhook plus scheduled reconciliation
Treat webhooks as best-effort notifications and add a periodic API poll (for example every few minutes) that reconciles state. The webhook gives you latency; the poll gives you a guarantee that nothing is permanently lost.
FIFO queue with care
When order genuinely matters, a FIFO queue preserves it — but remember that attaching a DLQ to a FIFO queue breaks strict ordering for affected messages, and the application must still be idempotent because FIFO exactly-once is a queue-layer property, not an end-to-end one.
10 — ConclusionReliability is four layers, built once.
Build for duplicates, acknowledge fast, reconcile always.
Webhook reliability is not a single trick but a small stack of mutually-reinforcing layers, and the order they go in matters. Verify the signature against the raw body with a constant-time compare. Deduplicate on the stable event ID with a cache that outlivesthe provider's retry window. Acknowledge with a fast 200 or 202 and push the work onto a durable queue. Retry with exponential backoff and jitter, route exhausted events to a monitored dead-letter queue, and reconcile periodically against the provider's API so an event that never arrived still gets caught.
The mental model that holds it all together is the one from the theory: exactly-once is a processing guarantee, never a delivery guarantee. Once you stop trying to make the wire perfect and start making your handler indifferent to duplicates, every other decision becomes a tuning exercise — how long to retain, how aggressively to retry, when to alert — rather than a correctness gamble.
And because no two providers agree on the details, the durable artifact from all of this is the matrix: retry window, signature scheme, dedup header, and minimum cache TTL per provider, written down once so the next integration is a lookup rather than an archaeology project. Build the four layers once, parameterize them per provider, and webhook reliability stops being a recurring incident and becomes a solved problem.