Background job and queue patterns share one uncomfortable premise: in any distributed system, message delivery is at-least-once, which means every job you enqueue can and eventually will run more than once. The mature response is not to chase exactly-once at the broker — it is to make your consumers idempotent so a duplicate run changes nothing.
That single fact reshapes how you think about retries, dead-letter queues, and tool selection. A dead-letter queue is not a safety net that quietly absorbs failures; it is a diagnostic instrument whose depth is a leading indicator that your processing service-level objective is broken. And the serverless job platforms that remove Redis operations also introduce per-run pricing that can cut against high-volume workloads.
This guide is a working reference. It covers the delivery-guarantee decision tree, exponential backoff with concrete wait times, the transactional outbox pattern, durable execution and its exactly-once nuance, backpressure and queue backlogs, a seven-tool selection matrix, and the observability signals that tell you a pipeline is healthy. Every claim below is drawn from primary documentation and named engineering sources.
- 01At-least-once is the architectural default, not a bug.The Sidekiq wiki states it plainly: jobs execute at least once, not exactly once, and even a completed job can re-run. Treat duplicate delivery as a given and design for it.
- 02Idempotency is the only durable answer to duplicates.BullMQ's guidance is the test: the final state of the system should not differ whether a job succeeds on its first attempt or fails and succeeds on retry. Use an idempotency key plus a deduplication store.
- 03Exponential backoff with jitter prevents retry storms.BullMQ's exponential strategy uses 2^(attempts-1) * delay, optionally multiplied by a 0–1 jitter float. Doubling spreads load; jitter breaks the thundering-herd synchronisation.
- 04The DLQ is a monitor, not a dustbin.AWS SQS redrives a message after ApproximateReceiveCount exceeds maxReceiveCount (default 10). A growing DLQ is the leading signal that processing is failing — alert on its depth.
- 05Pick the tool by delivery semantics, not by popularity.BullMQ, Celery, and Sidekiq give you at-least-once with self-hosted infrastructure; Temporal and Inngest add durable, step-level execution. Match the guarantee to the workload before the language.
01 — Delivery GuaranteesAt-least-once is the rule, not the exception.
Start every queue design from the guarantee, because it dictates everything downstream. Three delivery semantics exist in theory: at-most-once (fire and forget, drops on failure), at-least-once (redelivers until acknowledged, may duplicate), and exactly-once (each message processed precisely once). In practice, almost every production broker — Amazon SQS, Redis-backed BullMQ, Sidekiq, Celery — defaults to at-least-once, because it is the only one that survives network partitions and worker crashes without silently losing work.
The Sidekiq wiki frames this as an architectural given rather than a limitation: a job can be re-run even after it has completed, because the worker may crash after finishing the work but before acknowledging it. The redelivery that follows is correct behaviour — the broker has no way to know the side effect already happened.
"Sidekiq will execute your job at least once, not exactly once. Even a job which has completed can be re-run."— Sidekiq Wiki, Best Practices
This is why exactly-once delivery, in the strict end-to-end sense, is generally considered impractical to guarantee across a network: the acknowledgement that would confirm a single delivery can itself be lost. What systems like Temporal achieve is exactly-once execution of orchestration logic — covered in section 05 — built precisely on top of an at-least-once substrate. The pragmatic stance for application engineers is therefore not to fight at-least-once but to absorb it, which is exactly what idempotency does.
02 — IdempotencyThe delivery-guarantee decision tree nobody publishes.
Most references stop at "make it idempotent." That advice is correct but incomplete, because not every operation can be made idempotent cheaply. The useful framing is a branching decision: how hard is it to make this operation safe to repeat, and what do you do when it is genuinely impossible? BullMQ's own idempotency guidance sets the bar — it should make no difference to the final state of the system whether a job completes on its first attempt or fails and succeeds on retry.
That standard, applied honestly, produces three distinct strategies. For a cheaply-idempotent operation (setting a record to a known state), an idempotency key plus a deduplication check is enough. For an operation with an external, hard-to-reverse side effect (charging a card, sending an email), you need a guaranteed-once publish via the transactional outbox — and still an idempotent consumer on the other side. For a multi-step process that spans services, you need a saga with compensation, which durable-execution engines model directly.
Cheaply idempotent operation
The work can be made safe to repeat at low cost — a state-set, an upsert, a key-scoped write. Attach an idempotency key, check a deduplication store before acting, and let at-least-once redelivery be harmless.
Irreversible side effect
The operation cannot be cheaply undone — charging a card, sending a notification. Publish the event through a transactional outbox for guaranteed-once emission, and still deduplicate on the consumer using the event ID.
Multi-service saga
The flow spans services and must roll back partial progress on failure. Model it as a saga with compensating transactions; durable-execution engines such as Temporal let the compensation live in a plain catch block.
Chasing exactly-once at the broker
Configuring the broker to never redeliver is the trap. End-to-end exactly-once delivery is impractical across a network; effort spent here is effort not spent on the consumer-side idempotency that actually closes the gap.
The deduplication store is the load-bearing detail teams skip. An idempotency key is only useful if a fast, durable store records "this key has been processed" before — or atomically with — the side effect. A common implementation puts the key in Redis or a unique-constrained database column with a sensible time-to-live, so the second delivery short-circuits on the duplicate check. The same idempotency discipline applies one layer up in your idempotency and retry strategies for webhooks, where inbound events arrive with the same at-least-once guarantee.
03 — RetriesBackoff: exponential with jitter, and when to stop.
Once you accept retries, the next decision is the wait curve between them. BullMQ ships two built-in strategies: a fixed interval and an exponential one that follows the formula 2^(attempts-1) * delay milliseconds. Both can be jittered with a 0–1 float multiplied against the computed delay. The exponential curve is the right default because it gives a transient dependency room to recover while bounding the total retry load, and jitter is what stops a fleet of workers from retrying in lockstep and re-creating the spike that caused the failure.
The wait times compound quickly. With a one-second base, the gaps roughly double each attempt — about one, two, four, eight, sixteen, thirty-two, sixty-four, and one hundred twenty-eight seconds across the first eight tries. The chart below makes the trade-off concrete: an aggressive base clears transient blips fast but risks hammering a struggling dependency, while a longer base is gentler but leaves work sitting in the queue.
Exponential backoff wait time · 1s base, attempts 1–8
Source: BullMQ exponential formula 2^(attempts-1) × delayBackoff also needs an exit. BullMQ exposes a custom backoffStrategy hook where returning -1 moves the job straight to the failed state, bypassing further retries, while returning 0 sends it to the back of the waiting list. That control matters because not all failures deserve retries: a malformed payload or a 4xx-class permanent error should fail fast to the dead-letter queue rather than burn the full retry budget on a request that can never succeed.
SQS maxReceiveCount
Amazon SQS moves a message to the dead-letter queue once ApproximateReceiveCount exceeds maxReceiveCount. Setting it to 1 means a single transient failure routes straight to the DLQ — usually too aggressive.
BullMQ exponential
The 2^(attempts-1) × delay formula doubles the wait each attempt. Pair it with a 0–1 jitter float so a fleet of workers does not retry in synchronised waves.
Skip remaining retries
A backoffStrategy returning −1 sends the job directly to failed; returning 0 re-queues it at the back of the waiting list. Use −1 for permanent, non-retryable errors.
04 — Dead-Letter QueuesThe DLQ is a monitor, not a dustbin.
The dead-letter channel is one of the oldest patterns in messaging. Gregor Hohpe and Bobby Woolf named it in Enterprise Integration Patterns back in 2003: when a system decides it cannot or should not deliver a message, it moves the message aside rather than dropping or endlessly redelivering it. The mistake teams make in 2026 is treating that side channel as a graveyard — somewhere failures go to be forgotten — when its real job is to be watched.
"When a messaging system determines that it cannot or should not deliver a message, it may elect to move the message to a Dead Letter Channel."— Gregor Hohpe & Bobby Woolf, Enterprise Integration Patterns
The mechanics are worth getting exactly right. In Amazon SQS, a message lands in the DLQ when its ApproximateReceiveCount exceeds the maxReceiveCountset in the redrive policy — ten by default. There are two configuration traps. First, on standard queues a message's expiration is measured from its original enqueue timestamp even after it is redriven, so the DLQ's retention period must always exceed the source queue's retention, or redriven messages can expire before anyone inspects them. Second, queue type must match: a standard queue cannot use a FIFO queue as its dead-letter target, and the reverse is equally rejected with InvalidParameterValue.
The reframe that changes operations: a dead-letter queue is an SLO boundary. A non-zero, growing DLQ depth is the leading indicator that your processing objective is broken — not a backlog to clear quarterly, but an alert to fire now. Pair DLQ-depth monitoring with a triage runbook: classify each message as a transient failure to replay, a permanent error to discard, or a poison message that needs a code fix before any replay is safe.
maxReceiveCount high enough to absorb transient blips but low enough to surface real failures — three to five is a common starting point rather than the default ten. And alert on DLQ depth, because an unwatched dead-letter queue is just data loss with extra steps.05 — Guaranteed DeliveryThe transactional outbox and durable execution.
For the irreversible-side-effect branch of the decision tree, the transactional outbox is the canonical pattern. The problem it solves: you cannot atomically update your database and publish a message to a broker in a single transaction, so a crash between the two can leave them inconsistent. The outbox sidesteps this by writing the event to an outbox table inside the same database transaction as the business operation. A background relay then reads from the outbox, publishes the event, and marks the record processed — guaranteed delivery without a fragile two-phase commit across systems.
The modern implementation reads the outbox via change-data-capture rather than polling. Debezium is the widely-used CDC tool here, streaming committed rows to the broker shortly after the database commit. The trade-off to understand is that the outbox guarantees the event is published at least once — so the downstream consumer still has to be idempotent. The outbox solves the publish-atomicity problem; it does not eliminate the duplicate-processing problem.
Transactional outbox
Write the event to an outbox table in the same transaction as the business change, then relay it to the broker. Guarantees at-least-once publish without distributed two-phase commit. CDC tools like Debezium stream the outbox shortly after commit.
Durable execution
Temporal and step-based platforms persist progress and replay completed steps after a crash, so a long workflow survives process restarts without repeating finished work. The orchestration logic gets exactly-once execution semantics.
Saga with compensation
For multi-step flows that must roll back partial progress, the saga defines a compensating action per step. In Temporal, the saga simplifies to a try-catch block where compensations are the rollback actions in the catch clause.
Backbone choice matters here too. BullMQ, the most common Node.js job queue, is built on Redis — which is why teams adopting it usually need to get their Redis as the queue backbone fundamentals right before scaling, since Redis durability settings directly affect whether enqueued jobs survive a restart.
06 — BackpressureInsurmountable backlogs and how to avoid them.
Queues fail in a particular, predictable way. Amazon's engineering account "Avoiding Insurmountable Queue Backlogs" describes queue systems as bimodal: a fast mode where latency stays low because the backlog is clear, and a slow mode where latency grows continuously because work arrives faster than it drains. The unforgiving part is recovery — climbing out of a slow-mode event requires roughly double the processing capacity for the entire duration of the backlog, because you must drain the accumulated work while still keeping up with new arrivals.
Celery's documentation puts the same dynamic in plainer terms, and it is the line every engineer running workers should internalise.
"If a task takes 10 minutes to complete, and there are 10 new tasks coming in every minute, the queue will never be empty."— Celery Documentation, Optimizing
Two operational levers prevent and contain these events. The first is worker tuning: Celery's worker_prefetch_multiplier should be set to 1 for long-running tasks so each worker reserves only one task at a time, and raised to roughly 50–150 for short, high- throughput tasks; mixed workloads belong on separate worker nodes with distinct configurations. Process recycling via worker_max_tasks_per_child and worker_max_memory_per_child contains memory bloat, though setting them too low makes workers spend more time restarting than working.
The second lever is failure isolation. Shuffle-sharding routes each customer to a small, randomly-assigned subset of queues, so when one customer's queue backs up, its neighbours are statistically unaffected — failure isolation without dedicated per-customer infrastructure. This is the queue-layer cousin of API throttling; the same principles appear in our reference on API rate-limiting patterns. BullMQ's own global rate limiter applies the same idea at the queue level: a { max: 10, duration: 1000 } cap holds queue-wide regardless of worker count, so ten workers still process at most ten jobs per second across the whole queue.
Capacity to drain a backlog
Per AWS, recovering from a slow-mode backlog requires roughly double the processing capacity for the backlog's full duration — you drain the accumulated work while still serving new arrivals.
Celery prefetch multiplier
Set worker_prefetch_multiplier to 1 for long-running tasks so each worker reserves a single task at a time; raise it to ~50–150 for short, high-throughput jobs. Mixed workloads get separate nodes.
Shuffle-sharding subsets
Routing each customer to a small random subset of queues means one customer's backlog rarely touches its neighbours — failure isolation without per-customer infrastructure.
07 — Tool SelectionA seven-tool selection matrix for 2026.
The right tool is the one whose delivery semantics and operational model fit your workload — not the one that matches your primary language by reflex. The matrix below maps seven of the most common choices in 2026 against the dimensions that actually drive the decision: whether they are self-hostable, whether they offer durable step-level execution, and what kind of workload they suit. If your jobs are reactions to domain events rather than direct calls, read this alongside our reference on event-driven architecture and message queues, which frames the broader async picture this matrix sits inside.
BullMQCelerySidekiqInngestTrigger.devTemporalVercel Workflows| Tool | Model & hosting | Best for |
|---|---|---|
BullMQ | At-least-once · Node.js · self-hosted on Redis | TypeScript and Node teams that already run Redis. Rich retry, rate-limit, and scheduling APIs; you own the infrastructure and the operations burden. |
Celery | At-least-once · Python · self-hosted (Redis/RabbitMQ) | Python services needing mature worker tuning — prefetch control, process recycling, broad broker support. The default for Django and FastAPI background work. |
Sidekiq | At-least-once · Ruby · self-hosted on Redis | Ruby and Rails applications. Thread-based, fast, and battle-tested; keep job arguments to simple JSON-serializable primitives. |
Inngest | Durable steps · multi-language · managed (self-host option) | Serverless and event-driven teams that want step-level durability without running Redis. Completed steps replay from saved state on retry. |
Trigger.dev | Long-running compute · TypeScript · managed (self-host option) | Jobs that need to run for minutes or hours. v3 runs on dedicated compute rather than serverless functions, removing the function timeout ceiling. |
Temporal | Exactly-once workflow logic · multi-language · self-host or cloud | Complex, long-lived, multi-service orchestration and sagas. Exactly-once execution of workflow logic with at-least-once activities and built-in compensation. |
Vercel Workflows | Durable steps · TypeScript · managed on Vercel | Vercel-hosted apps needing pause/resume that maintains state for minutes to months — beyond the function duration limits, without separate queue infrastructure. |
Read the matrix in two passes. First, the self-hosted row — BullMQ, Celery, Sidekiq — all give you at-least-once delivery and demand that you run and monitor the broker, typically Redis. They are the economical default once you already operate that infrastructure. Second, the durable-execution row — Inngest, Trigger.dev, Temporal, Vercel Workflows — trades that operational burden for step-level replay and, in Temporal's case, exactly-once orchestration. The cost is a different pricing and vendor-dependency profile, which the next section addresses directly.
08 — Serverless PlatformsWhen to reach for managed job platforms.
Serverless job platforms exist to delete the Redis-operations problem. Inngest uses a step-based durable execution model: each step.run()call is persisted after it succeeds, so on retry the completed steps are skipped and replayed from saved state — the function re-runs from the top, but no finished work repeats. That step-level durability is the genuine differentiator over a plain queue, and it is confirmed directly in Inngest's own documentation. Trigger.dev v3 takes a different route: it runs jobs on dedicated long-running compute rather than serverless functions, which lifts the function timeout ceiling and lets a single job run for minutes or hours.
On the platform side, Vercel Functions (Node.js on Fluid Compute) default to a 300-second maximum duration on all plans, configurable up to 800 seconds on Pro and Enterprise, with concurrency that auto- scales to 30,000 on Hobby and Pro. Payloads above 4.5 MB return an HTTP 413 error. For work that needs to outlast those limits, Vercel documents Workflows (the Workflow DevKit) as the durable option.
"For workloads that require unlimited execution time, use Vercel Workflows, which allow your code to pause, resume, and maintain state for minutes to months without duration limits."— Vercel Documentation, Functions Limitations
The pricing trade-off is the part to reason about before you commit. Both Inngest and Trigger.dev offer a free tier in the region of tens of thousands of runs per month, with paid entry plans that are modest for low-to-moderate volume; self-hosting BullMQ instead trades that per-run cost for the comparatively small monthly cost of a managed Redis instance plus your own operations time. The economics invert at scale: per-run pricing is attractive until volume climbs into the high tens or hundreds of thousands of jobs per day, at which point the fixed cost of self-hosted Redis amortises and managed per-run billing becomes the more expensive path. Treat published vendor pricing as a starting estimate and confirm current tiers on each provider's own pricing page before you model a budget.
One platform caveat for 2026: Vercel Queues is a public beta offering at-least-once delivery, not a generally-available product. It is a reasonable thing to evaluate for Vercel-native pipelines, but do not architect a production system around beta delivery semantics without reading the current Vercel documentation for what it actually guarantees today.
09 — ObservabilityThe signals that tell you a pipeline is healthy.
Background jobs are hard to observe precisely because they are asynchronous: the request that enqueued a job is long gone by the time the work runs, and the job may hop through several queues before completing. A practical set of service-level objective signals covers most of what matters — P95 and P99 job duration, failure rate, queue depth paired with message age, retry-spike rate correlated to dependency errors, and, for scheduled work, missed-run and scheduling-skew alerts.
One metric subtlety is worth absorbing. For queue-entry lag, AWS recommends monitoring AgeOfFirstAttempt — the time from enqueue to the first delivery attempt — rather than ApproximateAgeOfOldestMessage, because the latter folds in retry noise and does not reflect true entry lag. Watching the wrong metric makes a healthy queue with normal retries look like a backlog, and a real backlog look fine.
The cross-queue tracing problem got a clean answer at QCon London 2026, where a team described embedding the originating request's start timestamp into OpenTelemetry trace state. Because trace state propagates across queue hops, any downstream span — however many queues it sits behind — can compute total elapsed time since the original request, closing the async-observability gap that plain span-to-span tracing leaves open in job pipelines.
P95 / P99 job duration
Track the tail, not the average — slow jobs are where SLO breaches and timeout-induced redeliveries originate. Alert on tail regressions against a rolling baseline.
Queue depth × message age
Depth alone is ambiguous; pair it with age. Use AgeOfFirstAttempt for true entry lag rather than the oldest-message metric, which retry noise inflates.
DLQ depth + retry-spike rate
A growing dead-letter queue is the leading failure indicator. Correlate retry spikes with downstream dependency errors to separate transient blips from systemic faults.
Missed-run & skew alerts
For cron-style jobs, alert on missed runs and scheduling skew — a job that silently stops firing is invisible to duration and failure-rate metrics alone.
10 — ConclusionDesign for the duplicate, monitor the dead letters.
At-least-once is the rule, so idempotency is the answer.
Reliable background processing in 2026 still rests on a premise that predates every tool in the matrix: delivery is at-least-once, so the system must be safe to repeat. Everything practical follows from taking that seriously — idempotent consumers, exponential backoff with jitter, dead-letter queues you actually watch, and a transactional outbox when an event must be published exactly when its data is committed.
Tool choice is downstream of the guarantee. Self-hosted queues like BullMQ, Celery, and Sidekiq give you at-least-once delivery and the operations bill that comes with running a broker. Durable-execution engines like Temporal and step-based platforms like Inngest add replayable, crash-safe orchestration — with Temporal's exactly-once guarantee correctly scoped to workflow logic, never to the activities it calls. The newest managed options trade Redis operations for per-run pricing that only pays off below a volume breakpoint you should model against real numbers.
The broader signal is that the hard part of background jobs was never the enqueue — it was the failure path. The teams that run async work well are the ones who treat the dead-letter queue as an alert, not an archive, and who can answer "what happens when this job runs twice" without hesitation. Build for the duplicate, instrument the backlog, and the queue stops being the part of the system you fear during an incident.