Webhook reliability comes down to a single uncomfortable fact: every major provider delivers events at least once, never exactly once, which means your consumer will eventually receive the same event twice and must handle it without double-charging, double-shipping, or double-emailing. Idempotency is not a nice-to-have — it is the load-bearing wall of any production webhook integration.

The cost of getting this wrong is concrete. A payment event processed twice becomes a duplicate charge. An order-created event processed twice becomes a duplicate fulfillment. And the most common cause is not a network glitch you can see — it is your own handler finishing the work but answering the provider a few milliseconds too late, so the provider assumes failure and retries an operation that already succeeded.

This reference assembles the four reliability layers — deduplication, retry with backoff, dead-letter queues, and signature verification — into decision tables you can build against. We compare Stripe, Shopify, Svix, and Amazon SQS side by side, then give a status-code decision tree, recommended DLQ thresholds, and the accept-then-queue architecture that absorbs all of it.

Key takeaways

01
At-least-once delivery is universal — exactly-once is not.Stripe's own docs state endpoints 'might occasionally receive the same event more than once.' AWS SQS Standard queues say the same. Treating delivery as exactly-once is a category error; build for duplicates by default.
02
Idempotency is the consumer's job, keyed on a stable event ID.Every well-designed webhook carries a unique event ID that stays constant across retries — Stripe's event id, Shopify's X-Shopify-Webhook-Id, Svix's webhook-id. Store it; skip reprocessing on a repeat. The dedup cache TTL must outlive the provider's full retry window.
03
The timeout-induced duplicate is the most dangerous failure mode.If your handler completes the work but the HTTP response lands after the provider's timeout (often 5–15s), the provider retries and your business logic runs twice. The fix is counter-intuitive: acknowledge with 200/202 immediately, then process asynchronously.
04
Retry policy differs sharply by provider — read the matrix, not the headline.Stripe retries for up to 3 days in live mode; Shopify retries 8 times over 4 hours and may auto-delete Admin-API subscriptions after 8 straight failures; Svix runs roughly 8 attempts across about a day. Configure DLQ retention and dedup TTL to each provider, not to a single global number.
05
Signature verification needs constant-time comparison.HMAC-SHA256 is near-universal, but headers and signed-payload composition are not standardized — Stripe prepends a timestamp, Shopify does not. Always compare signatures with hmac.compare_digest or crypto.timingSafeEqual to avoid timing-attack leakage.

01 — The Delivery GuaranteeWhy exactly-once delivery is impossible.

Before designing anything, internalize the constraint: no webhook provider can guarantee exactly-once delivery, because exactly-once delivery is a proven impossibility in distributed systems, not a feature waiting to be built. The result traces back to the Two Generals Problem and the FLP impossibility theorem published in 1985, which showed that consensus cannot be guaranteed in an asynchronous network where even one participant may fail.

The practical consequence is a clean distinction worth carrying into every design review: exactly-once is achievable as a processing guarantee, never as a delivery guarantee. The wire will sometimes deliver a duplicate. What you control is whether processing that duplicate produces a second side effect. That is the entire job of idempotency, and it sits on the consumer side — the receiver — not the sender.

Exactly-once is not a delivery guarantee. It is a processing guarantee.— Hookdeck, At-Least-Once vs. Exactly-Once Webhook Delivery Guarantees

Both major-vendor documentation and the queueing layer agree explicitly. Stripe's webhook docs warn that an endpoint "might occasionally receive the same event more than once." Amazon SQS Standard queues describe the same behavior at the infrastructure level: because messages are stored redundantly across multiple servers, a copy can reappear if the server holding a given message is unavailable during deletion. AWS's guidance is blunt — design applications to be idempotent so that processing the same message more than once does no harm.

The reframe that ends the argument

When a product manager asks for "guaranteed delivery," the correct answer is that delivery cannot be guaranteed exactly once — but processing can be made exactly-once through idempotency plus reconciliation. Frame the requirement as a processing guarantee and the engineering becomes tractable. Frame it as a delivery guarantee and you are chasing a theorem that says no.

02 — Provider Decision MatrixOne table for four providers, every reliability column.

The single most useful artifact for a multi-vendor webhook integration is a matrix that puts retry policy, signature scheme, and deduplication semantics side by side. No two providers agree on header names, signed-payload composition, or retry windows — so the integration cost is real, and it compounds per provider. Truto, a multi-SaaS integration platform, reports in-house integration maintenance running upward of $50,000 per integration as the hidden driver behind these inconsistencies. Treat that figure as vendor-stated context rather than an independent benchmark, but the direction is right: heterogeneity is expensive.

Webhook provider reliability decision matrix · values current as of cited dates · verify against live provider docs before shipping
Dimension	Stripe	Shopify	Svix	AWS SQS (Standard)
Delivery guarantee	At-least-once	At-least-once	At-least-once	At-least-once
Retry window	Up to 3 days (live mode)	~4 hours	~1 day (≈27h)	Bounded by visibility timeout + maxReceiveCount
Retry count	Many (3 in sandbox)	8 attempts	~8 attempts	Per maxReceiveCount
Backoff	Exponential	Exponential	Stepped schedule	Visibility-timeout driven
Signature header	`Stripe-Signature`	`X-Shopify-Hmac-SHA256`	`svix-signature`	IAM / SQS auth (no HMAC header)
Signing algorithm	HMAC-SHA256	HMAC-SHA256 (base64)	HMAC-SHA256	AWS SigV4 (transport)
Signed payload	`timestamp.body`	raw body	id.timestamp.body	n/a
Replay tolerance	5 minutes (default)	Use `X-Shopify-Triggered-At`	Timestamp in signature	n/a
Dedup ID header	Event `id`	`X-Shopify-Webhook-Id`	`webhook-id` (stable)	MessageId / dedup ID (FIFO)
Auto-disable	After sustained failure	After 8 fails (Admin-API subs)	Per endpoint config	DLQ via maxReceiveCount
Min dedup cache TTL	≥ 3 days	≥ 4 hours (24h safe)	≥ 24 hours	≥ retention window

Read the footnotes, not just the cells

Three watch-outs hide inside the matrix. Stripe's 3-day window is live mode only — sandbox retries roughly three times over a few hours. Shopify's auto-deletion applies to Admin-API-created subscriptions, not necessarily every subscription type. And the Svix step schedule here is the representative timing documented by Hookdeck; confirm exact intervals against Svix's own delivery docs before relying on them.

03 — Consumer-Side IdempotencyDeduplicate on a stable event ID.

The mechanism is simple and the discipline is everything: every well-designed webhook carries a unique event ID that stays the same across retries. That identifier is precisely what distinguishes a retry from a genuinely new event. On receipt, you record the ID; if you have seen it before within your dedup window, you acknowledge and do nothing. The provider gets its 200, your business logic runs exactly once.

Two storage strategies cover almost every case. A database unique constraint on the event ID suits transactional operations — the insert either succeeds (first time) or violates the constraint (duplicate), and you branch on that atomically. A Redis key with a TTL suits high-throughput streams where a database round-trip per event is too costly. The choice is about throughput and consistency needs, not correctness — both work.

Every well-designed webhook includes a unique event ID. This identifier stays the same across retries, which is what distinguishes a retry from a genuinely new event.— Svix Webhook University, Idempotency and Deduplication

The TTL formula nobody states out loud

Your deduplication cache must persist at least as long as the provider's full retry window. If a provider retries for up to three days and your Redis TTL is one day, a day-two retry sails past an expired key and reprocesses as new. Set dedup TTL ≥ retry window per provider — three days for Stripe live mode, at least a day for Svix, several hours (a day to be safe) for Shopify.

Sender-side idempotency keys are the mirror image and worth understanding even as a consumer. Stripe's API accepts an Idempotency-Key header (max 255 characters, a V4 UUID or other high-entropy string, never containing sensitive data such as an email). Stripe saves the resulting status code and body of the first request for that key — including a 500 — and replays it for repeat requests, with a 24-hour TTL after which the key becomes purgeable. The idempotency layer also compares incoming parameters to the original request and errors if they differ, preventing accidental key reuse across different operations.

04 — Retry, Backoff & JitterExponential backoff, plus jitter to break the herd.

When a delivery fails, the sender retries on an increasing delay — exponential backoff — so a struggling consumer is not hammered. But naive exponential backoff has a failure mode of its own: if a thousand events fail at the same instant, they all retry at the same computed delay, producing a synchronized thundering herd that knocks the recovering endpoint over again. Jitter randomizes those delays so the load spreads out.

Strategy

Full jitter

random(0, exp_delay)

Picks a delay uniformly between zero and the full exponential value. Maximum spread, the most common choice in production senders because it most effectively dissolves synchronized retries.

Recommended default

Strategy

Equal jitter

exp/2 + random(0, exp/2)

Guarantees at least half the exponential delay, then randomizes the rest. A compromise when you want some minimum spacing but still want to break synchronization.

Balanced spacing

Strategy

Decorrelated jitter

derived from previous delay

Each delay is computed from the previous one rather than the attempt number, producing a smoother growth curve. Useful when you want adaptive spacing without a hard exponential ceiling.

Adaptive

On the receiving end of a managed gateway, the configurable budget is generous. Production webhook systems typically cap individual retry intervals between 6 and 12 hours with a total window of 1 to 3 days, and disable an endpoint after roughly 3 to 5 days of sustained failure. As a point of reference for what is achievable, Hookdeck's gateway documents up to 50 delivery attempts over as long as a week — that is a product maximum, not an industry norm, and it is far more aggressive than what Stripe or Shopify do natively.

05 — Status-Code Decision TreeWhich response codes are retriable — and the two surprises.

Most guides collapse this into "2xx good, 4xx don't retry, 5xx retry." That heuristic is wrong in two important places: a 408 Request Timeout and a 429 Too Many Requests are both 4xx codes that should be retried. The table below is the version that survives production. Note that most providers, including Stripe, treat 3xx redirects as non-retriable failures — point your webhook at the final URL, never a redirect.

Webhook retry status-code decision tree · general provider behavior; confirm per-provider edge cases
Response	Retry?	Reason	Recommended consumer action
2xx	No	Acknowledged	Return 200/202 fast; do real work async
3xx	No	Treated as failure by most providers	Register the final URL, never a redirect
400 / 401 / 403 / 404 / 410	No	Client error — retry will not help	Fix the endpoint or auth; alert, do not loop
408 Request Timeout	Yes	Transient — exception to the 4xx rule	Allow retry with backoff
429 Too Many Requests	Yes	Rate-limited — exception to the 4xx rule	Honor `Retry-After` header
5xx	Yes	Server error — likely transient	Retry with exponential backoff + jitter
Connection / DNS failure	Yes	No response — transient by assumption	Retry; trip circuit breaker if sustained

For outbound senders, pair this tree with a per-endpoint circuit breaker: open the breaker when, say, 5 of the last 10 requests fail, hold it open for a cooldown of roughly 30 to 120 seconds, then half-open to test recovery before resuming full traffic. Per-endpoint scope matters in multi-tenant systems — one customer's broken endpoint should never throttle deliveries to everyone else.

06 — Dead-Letter QueuesCatch the poison events before they block the queue.

A dead-letter queue is where events go after exhausting their retries, so a single un-processable "poison" event does not wedge the main pipeline forever. In Amazon SQS the mechanism is a redrive policy with a maxReceiveCount: once a consumer has received a message that many times without deleting it, SQS moves the message to the DLQ. AWS guidance is to set maxReceiveCount high enough to permit genuine retries — at least 3 for standard queues — so transient errors are not misclassified as poison.

Two SQS subtleties bite teams in production. First, DLQ message expiration is computed from the original enqueue timestamp, not DLQ arrival — a message that spent a day in the source queue before failing has only the remaining retention left once it lands in the DLQ, so always set DLQ retention longer than source retention. Second, attaching a DLQ to a FIFO queue breaks strict ordering for the affected messages; decide whether ordering or poison-isolation matters more for that stream.

Alert on depth

DLQ depth threshold

10events

Recommended practice is to alert when DLQ depth exceeds roughly 10 events — a small standing backlog is the early signal that a downstream dependency or handler bug is dropping events into the dead-letter path.

page on-call

Alert on age

Oldest-event age

1hour

Alert when the oldest event in the DLQ has sat unreviewed for more than about an hour. Age catches the slow leak that depth alone misses — one event that never gets triaged is a silent data-loss risk.

freshness SLO

Retention floor

DLQ retention

14days

A 14-day DLQ retention is a sensible recommendation versus a much shorter main-queue retention, with 30 days as a reasonable floor for webhook DLQs — long enough to investigate, replay, and reconcile before events expire.

vs ~4d main queue

The DLQ is not a graveyard — it is a replay buffer. Pair it with a tool that lets an operator inspect a dead-lettered event, fix the underlying handler bug, and redrive the event back through processing. Because every event is keyed on its idempotent ID, replaying a DLQ event that was actually processed before failing is safe: the dedup layer absorbs the duplicate. These same dead-letter and redrive mechanics sit at the heart of any background worker, which our background job queue patterns reference covers alongside choosing between BullMQ, Inngest, and Temporal.

07 — Signature VerificationHMAC-SHA256 everywhere, but no two providers agree on the details.

Signature verification proves an event genuinely came from the provider and was not forged or tampered with. The algorithm is nearly universal — HMAC-SHA256 keyed on a shared secret — but the surrounding conventions are a babel. Stripe signs {timestamp}.{raw_body} and delivers it in Stripe-Signature as t=<ts>,v1=<sig>. Shopify signs the raw body, base64-encodes the digest, and sends it in X-Shopify-Hmac-SHA256 — no timestamp prefix. GitHub uses X-Hub-Signature-256; Slack uses X-Slack-Signature. There is no shared standard to code against, which is exactly why a per-provider matrix earns its keep.

Two rules are non-negotiable. First, always verify against the raw request body bytes, before any JSON parsing or framework deserialization re-serializes and changes the bytes — a single whitespace difference breaks the HMAC. Second, compare the computed and received signatures with a constant-time comparison function — hmac.compare_digest in Python, crypto.timingSafeEqual in Node.js. A naive == comparison returns faster on an earlier-mismatching byte, leaking timing information an attacker can use to guess the signature byte by byte.

Replay protection is half the signature

A valid signature only proves authenticity, not freshness — a captured valid request can be replayed. Stripe defends this with a timestamp in the signature and a default 5-minute tolerance: reject anything whose embedded timestamp is older than five minutes. Shopify exposes X-Shopify-Triggered-At so you can detect a stale payload during retries. If a provider signs a timestamp, check it.

08 — The Core PatternAccept fast, then queue — the move that fixes everything.

The single most dangerous failure mode in webhook handling is the timeout-induced duplicate. Your handler receives the event, does the real work — charges the card, ships the order — and then takes a beat too long to respond. The provider's timeout fires (often 5 to 15 seconds; Shopify is stricter still at a 1-second connection timeout and a 5-second full-request timeout), it concludes delivery failed, and it retries an operation that already completed. The duplicate is born not from a network problem but from your own slowness.

The fix is counter-intuitive for anyone trained on synchronous HTTP: do not do the work in the request. Verify the signature, deduplicate on the event ID, persist the raw event to a durable queue, and return 200 or 202 immediately. Then process asynchronously from the queue. This decoupling is what lets a system absorb thousands of events per second; a well-built ingestion gateway can keep added latency under a few seconds for the overwhelming majority of events while doing all of this.

Accept the event immediately, push it onto a durable queue, and process it asynchronously.— Hookdeck, Webhook Infrastructure Guide

Once you adopt accept-then-queue, the other layers slot in cleanly. Signature verification happens at the edge before anything is enqueued. Idempotency is checked against the dedup store before the work runs. Failed processing increments a retry count and eventually lands in the DLQ. Each concern lives in one place. This is the same reliability layer that makes any event-driven integration production-safe — the kind of architecture we build into custom web and application development engagements and reuse when wiring real-time triggers in CRM automation workflows. If you are evaluating a build-versus-buy decision for the ingestion layer itself, our total-cost-of-ownership analysis for workflow automation walks through where webhook reliability lands on the ledger.

09 — Ordering & ReconciliationEvents arrive out of order — design for it.

Stripe states plainly that it does not guarantee events arrive in the order they were generated. A customer.subscription.updated can land before the created that logically preceded it. Building state machines that assume ordered arrival is a quiet recipe for corrupted state. The robust pattern is to treat each webhook as a trigger, not a source of truth: on receipt, fetch the object's current state from the provider's API and act on that, rather than trusting the event payload to reflect the latest reality. The same reliability layer underpins event-driven tools we have written about elsewhere — the inbound deliveries in our Slack event-subscriptions tutorial and the dispatch path in our MCP server TypeScript walkthrough both depend on exactly these idempotency and retry guarantees.

Looking ahead, the standards layer is slowly converging even as provider conventions stay fragmented. The CNCF's CloudEvents specification — version 1.0.2, released in 2022 and graduated as a CNCF project on January 25, 2024 — defines a common event envelope with fields such as id, source, type, time, and datacontenttype, and is adopted across Amazon EventBridge, Azure Event Grid, Google Cloud Eventarc, and Knative. That id field is, conveniently, the same stable identifier your dedup layer wants. As CloudEvents adoption widens, the per-provider header babel should narrow — though pragmatically, expect to keep a provider matrix for years yet.

Real-time only

Webhook as the sole trigger

Acceptable for low-stakes notifications — a Slack ping, a cache bust — where a missed event is survivable. Do not use this alone for money, inventory, or anything where a dropped event causes real harm.

Fine for low stakes

Belt and braces

Webhook plus scheduled reconciliation

Treat webhooks as best-effort notifications and add a periodic API poll (for example every few minutes) that reconciles state. The webhook gives you latency; the poll gives you a guarantee that nothing is permanently lost.

Use for anything that matters

Strict ordering

FIFO queue with care

When order genuinely matters, a FIFO queue preserves it — but remember that attaching a DLQ to a FIFO queue breaks strict ordering for affected messages, and the application must still be idempotent because FIFO exactly-once is a queue-layer property, not an end-to-end one.

Only when order is required

10 — ConclusionReliability is four layers, built once.

The shape of a reliable webhook consumer

Build for duplicates, acknowledge fast, reconcile always.

Webhook reliability is not a single trick but a small stack of mutually-reinforcing layers, and the order they go in matters. Verify the signature against the raw body with a constant-time compare. Deduplicate on the stable event ID with a cache that outlives the provider's retry window. Acknowledge with a fast 200 or 202 and push the work onto a durable queue. Retry with exponential backoff and jitter, route exhausted events to a monitored dead-letter queue, and reconcile periodically against the provider's API so an event that never arrived still gets caught.

The mental model that holds it all together is the one from the theory: exactly-once is a processing guarantee, never a delivery guarantee. Once you stop trying to make the wire perfect and start making your handler indifferent to duplicates, every other decision becomes a tuning exercise — how long to retain, how aggressively to retry, when to alert — rather than a correctness gamble.

And because no two providers agree on the details, the durable artifact from all of this is the matrix: retry window, signature scheme, dedup header, and minimum cache TTL per provider, written down once so the next integration is a lookup rather than an archaeology project. Build the four layers once, parameterize them per provider, and webhook reliability stops being a recurring incident and becomes a solved problem.

Webhook Reliability: Idempotency & Retry Reference

01 — The Delivery GuaranteeWhy exactly-once delivery is impossible.

02 — Provider Decision MatrixOne table for four providers, every reliability column.

03 — Consumer-Side IdempotencyDeduplicate on a stable event ID.

04 — Retry, Backoff & JitterExponential backoff, plus jitter to break the herd.

Full jitter

Equal jitter

Decorrelated jitter

05 — Status-Code Decision TreeWhich response codes are retriable — and the two surprises.

06 — Dead-Letter QueuesCatch the poison events before they block the queue.

DLQ depth threshold

Oldest-event age

DLQ retention

07 — Signature VerificationHMAC-SHA256 everywhere, but no two providers agree on the details.

08 — The Core PatternAccept fast, then queue — the move that fixes everything.

09 — Ordering & ReconciliationEvents arrive out of order — design for it.

Webhook as the sole trigger

Webhook plus scheduled reconciliation

FIFO queue with care

10 — ConclusionReliability is four layers, built once.

Build for duplicates, acknowledge fast, reconcile always.

Reliable webhooks turn real-time data from an incident channel into an asset.

Webhook & integration engagements

The questions we get every week.

Continue exploring engineering references.

Background Jobs and Queues: 2026 Engineering Reference

REST API Design in 2026: A Full Engineering Reference

API Rate Limiting Strategies: 2026 Engineering Reference

Cursor Automations: Always-On Agentic Coding Guide

Stripe Payment Integration: Complete Dev Guide 2026