DevelopmentIndustry Guide12 min readPublished June 3, 2026

At-least-once delivery · idempotency as the answer · dead-letter queues as a diagnostic instrument

Background Jobs and Queues: 2026 Engineering Reference

Every distributed system runs on at-least-once delivery — so idempotency is not optional, it is the only safe response. This reference walks the delivery-guarantee decision tree, retry backoff, dead-letter queues, the transactional outbox, durable execution, and a seven-tool selection matrix for 2026.

DA
Digital Applied Team
Senior engineers · Published June 3, 2026
PublishedJune 3, 2026
Read time12 min
Sources9 primary docs
Delivery guarantee
≥1×
at-least-once is the default
SQS default redrive
10
maxReceiveCount before DLQ
Backoff doubling
2^n
BullMQ exponential formula
Tools compared
7
in the selection matrix

Background job and queue patterns share one uncomfortable premise: in any distributed system, message delivery is at-least-once, which means every job you enqueue can and eventually will run more than once. The mature response is not to chase exactly-once at the broker — it is to make your consumers idempotent so a duplicate run changes nothing.

That single fact reshapes how you think about retries, dead-letter queues, and tool selection. A dead-letter queue is not a safety net that quietly absorbs failures; it is a diagnostic instrument whose depth is a leading indicator that your processing service-level objective is broken. And the serverless job platforms that remove Redis operations also introduce per-run pricing that can cut against high-volume workloads.

This guide is a working reference. It covers the delivery-guarantee decision tree, exponential backoff with concrete wait times, the transactional outbox pattern, durable execution and its exactly-once nuance, backpressure and queue backlogs, a seven-tool selection matrix, and the observability signals that tell you a pipeline is healthy. Every claim below is drawn from primary documentation and named engineering sources.

Key takeaways
  1. 01
    At-least-once is the architectural default, not a bug.The Sidekiq wiki states it plainly: jobs execute at least once, not exactly once, and even a completed job can re-run. Treat duplicate delivery as a given and design for it.
  2. 02
    Idempotency is the only durable answer to duplicates.BullMQ's guidance is the test: the final state of the system should not differ whether a job succeeds on its first attempt or fails and succeeds on retry. Use an idempotency key plus a deduplication store.
  3. 03
    Exponential backoff with jitter prevents retry storms.BullMQ's exponential strategy uses 2^(attempts-1) * delay, optionally multiplied by a 0–1 jitter float. Doubling spreads load; jitter breaks the thundering-herd synchronisation.
  4. 04
    The DLQ is a monitor, not a dustbin.AWS SQS redrives a message after ApproximateReceiveCount exceeds maxReceiveCount (default 10). A growing DLQ is the leading signal that processing is failing — alert on its depth.
  5. 05
    Pick the tool by delivery semantics, not by popularity.BullMQ, Celery, and Sidekiq give you at-least-once with self-hosted infrastructure; Temporal and Inngest add durable, step-level execution. Match the guarantee to the workload before the language.

01Delivery GuaranteesAt-least-once is the rule, not the exception.

Start every queue design from the guarantee, because it dictates everything downstream. Three delivery semantics exist in theory: at-most-once (fire and forget, drops on failure), at-least-once (redelivers until acknowledged, may duplicate), and exactly-once (each message processed precisely once). In practice, almost every production broker — Amazon SQS, Redis-backed BullMQ, Sidekiq, Celery — defaults to at-least-once, because it is the only one that survives network partitions and worker crashes without silently losing work.

The Sidekiq wiki frames this as an architectural given rather than a limitation: a job can be re-run even after it has completed, because the worker may crash after finishing the work but before acknowledging it. The redelivery that follows is correct behaviour — the broker has no way to know the side effect already happened.

"Sidekiq will execute your job at least once, not exactly once. Even a job which has completed can be re-run."— Sidekiq Wiki, Best Practices

This is why exactly-once delivery, in the strict end-to-end sense, is generally considered impractical to guarantee across a network: the acknowledgement that would confirm a single delivery can itself be lost. What systems like Temporal achieve is exactly-once execution of orchestration logic — covered in section 05 — built precisely on top of an at-least-once substrate. The pragmatic stance for application engineers is therefore not to fight at-least-once but to absorb it, which is exactly what idempotency does.

Why this matters first
If you remember one thing from this reference, make it this: design your consumer so a repeat delivery is a no-op, rather than trying to force the broker to deliver only once. The first is tractable engineering; the second is a distributed-systems impossibility you will lose time and money chasing.

02IdempotencyThe delivery-guarantee decision tree nobody publishes.

Most references stop at "make it idempotent." That advice is correct but incomplete, because not every operation can be made idempotent cheaply. The useful framing is a branching decision: how hard is it to make this operation safe to repeat, and what do you do when it is genuinely impossible? BullMQ's own idempotency guidance sets the bar — it should make no difference to the final state of the system whether a job completes on its first attempt or fails and succeeds on retry.

That standard, applied honestly, produces three distinct strategies. For a cheaply-idempotent operation (setting a record to a known state), an idempotency key plus a deduplication check is enough. For an operation with an external, hard-to-reverse side effect (charging a card, sending an email), you need a guaranteed-once publish via the transactional outbox — and still an idempotent consumer on the other side. For a multi-step process that spans services, you need a saga with compensation, which durable-execution engines model directly.

Branch 1
Cheaply idempotent operation

The work can be made safe to repeat at low cost — a state-set, an upsert, a key-scoped write. Attach an idempotency key, check a deduplication store before acting, and let at-least-once redelivery be harmless.

At-least-once + idempotency key
Branch 2
Irreversible side effect

The operation cannot be cheaply undone — charging a card, sending a notification. Publish the event through a transactional outbox for guaranteed-once emission, and still deduplicate on the consumer using the event ID.

Transactional outbox + idempotent consumer
Branch 3
Multi-service saga

The flow spans services and must roll back partial progress on failure. Model it as a saga with compensating transactions; durable-execution engines such as Temporal let the compensation live in a plain catch block.

Durable execution + compensations
Anti-pattern
Chasing exactly-once at the broker

Configuring the broker to never redeliver is the trap. End-to-end exactly-once delivery is impractical across a network; effort spent here is effort not spent on the consumer-side idempotency that actually closes the gap.

Avoid

The deduplication store is the load-bearing detail teams skip. An idempotency key is only useful if a fast, durable store records "this key has been processed" before — or atomically with — the side effect. A common implementation puts the key in Redis or a unique-constrained database column with a sensible time-to-live, so the second delivery short-circuits on the duplicate check. The same idempotency discipline applies one layer up in your idempotency and retry strategies for webhooks, where inbound events arrive with the same at-least-once guarantee.

03RetriesBackoff: exponential with jitter, and when to stop.

Once you accept retries, the next decision is the wait curve between them. BullMQ ships two built-in strategies: a fixed interval and an exponential one that follows the formula 2^(attempts-1) * delay milliseconds. Both can be jittered with a 0–1 float multiplied against the computed delay. The exponential curve is the right default because it gives a transient dependency room to recover while bounding the total retry load, and jitter is what stops a fleet of workers from retrying in lockstep and re-creating the spike that caused the failure.

The wait times compound quickly. With a one-second base, the gaps roughly double each attempt — about one, two, four, eight, sixteen, thirty-two, sixty-four, and one hundred twenty-eight seconds across the first eight tries. The chart below makes the trade-off concrete: an aggressive base clears transient blips fast but risks hammering a struggling dependency, while a longer base is gentler but leaves work sitting in the queue.

Exponential backoff wait time · 1s base, attempts 1–8

Source: BullMQ exponential formula 2^(attempts-1) × delay
Attempt 12^0 × 1s base
1s
Attempt 32^2 × 1s base
4s
Attempt 52^4 × 1s base
16s
Attempt 62^5 × 1s base
32s
Attempt 72^6 × 1s base
64s
Attempt 82^7 × 1s base
128s

Backoff also needs an exit. BullMQ exposes a custom backoffStrategy hook where returning -1 moves the job straight to the failed state, bypassing further retries, while returning 0 sends it to the back of the waiting list. That control matters because not all failures deserve retries: a malformed payload or a 4xx-class permanent error should fail fast to the dead-letter queue rather than burn the full retry budget on a request that can never succeed.

Default redrive
SQS maxReceiveCount
10

Amazon SQS moves a message to the dead-letter queue once ApproximateReceiveCount exceeds maxReceiveCount. Setting it to 1 means a single transient failure routes straight to the DLQ — usually too aggressive.

Configurable per redrive policy
Backoff base
BullMQ exponential
2^n

The 2^(attempts-1) × delay formula doubles the wait each attempt. Pair it with a 0–1 jitter float so a fleet of workers does not retry in synchronised waves.

Fixed strategy also available
Fail-fast hook
Skip remaining retries
−1

A backoffStrategy returning −1 sends the job directly to failed; returning 0 re-queues it at the back of the waiting list. Use −1 for permanent, non-retryable errors.

Custom strategy return value

04Dead-Letter QueuesThe DLQ is a monitor, not a dustbin.

The dead-letter channel is one of the oldest patterns in messaging. Gregor Hohpe and Bobby Woolf named it in Enterprise Integration Patterns back in 2003: when a system decides it cannot or should not deliver a message, it moves the message aside rather than dropping or endlessly redelivering it. The mistake teams make in 2026 is treating that side channel as a graveyard — somewhere failures go to be forgotten — when its real job is to be watched.

"When a messaging system determines that it cannot or should not deliver a message, it may elect to move the message to a Dead Letter Channel."— Gregor Hohpe & Bobby Woolf, Enterprise Integration Patterns

The mechanics are worth getting exactly right. In Amazon SQS, a message lands in the DLQ when its ApproximateReceiveCount exceeds the maxReceiveCountset in the redrive policy — ten by default. There are two configuration traps. First, on standard queues a message's expiration is measured from its original enqueue timestamp even after it is redriven, so the DLQ's retention period must always exceed the source queue's retention, or redriven messages can expire before anyone inspects them. Second, queue type must match: a standard queue cannot use a FIFO queue as its dead-letter target, and the reverse is equally rejected with InvalidParameterValue.

The reframe that changes operations: a dead-letter queue is an SLO boundary. A non-zero, growing DLQ depth is the leading indicator that your processing objective is broken — not a backlog to clear quarterly, but an alert to fire now. Pair DLQ-depth monitoring with a triage runbook: classify each message as a transient failure to replay, a permanent error to discard, or a poison message that needs a code fix before any replay is safe.

DLQ configuration checklist
DLQ retention must exceedthe source queue's retention. Queue types must match (standard-to-standard, FIFO-to-FIFO). Set maxReceiveCount high enough to absorb transient blips but low enough to surface real failures — three to five is a common starting point rather than the default ten. And alert on DLQ depth, because an unwatched dead-letter queue is just data loss with extra steps.

05Guaranteed DeliveryThe transactional outbox and durable execution.

For the irreversible-side-effect branch of the decision tree, the transactional outbox is the canonical pattern. The problem it solves: you cannot atomically update your database and publish a message to a broker in a single transaction, so a crash between the two can leave them inconsistent. The outbox sidesteps this by writing the event to an outbox table inside the same database transaction as the business operation. A background relay then reads from the outbox, publishes the event, and marks the record processed — guaranteed delivery without a fragile two-phase commit across systems.

The modern implementation reads the outbox via change-data-capture rather than polling. Debezium is the widely-used CDC tool here, streaming committed rows to the broker shortly after the database commit. The trade-off to understand is that the outbox guarantees the event is published at least once — so the downstream consumer still has to be idempotent. The outbox solves the publish-atomicity problem; it does not eliminate the duplicate-processing problem.

Pattern
Transactional outbox
DB write + outbox row · one transaction

Write the event to an outbox table in the same transaction as the business change, then relay it to the broker. Guarantees at-least-once publish without distributed two-phase commit. CDC tools like Debezium stream the outbox shortly after commit.

Consumer must still be idempotent
Engine
Durable execution
code-as-workflow · automatic replay

Temporal and step-based platforms persist progress and replay completed steps after a crash, so a long workflow survives process restarts without repeating finished work. The orchestration logic gets exactly-once execution semantics.

Activities remain at-least-once
Composition
Saga with compensation
try / catch rollback

For multi-step flows that must roll back partial progress, the saga defines a compensating action per step. In Temporal, the saga simplifies to a try-catch block where compensations are the rollback actions in the catch clause.

Cross-service consistency
Preserve the nuance
Temporal provides exactly-once execution for workflow logicand at-least-once for activities — the side-effecting steps a workflow calls. Collapsing that to "Temporal gives you exactly-once" is wrong: your activities can still run more than once, so they must be idempotent like everything else. The durability is in the orchestration, not in the side effects.

Backbone choice matters here too. BullMQ, the most common Node.js job queue, is built on Redis — which is why teams adopting it usually need to get their Redis as the queue backbone fundamentals right before scaling, since Redis durability settings directly affect whether enqueued jobs survive a restart.

06BackpressureInsurmountable backlogs and how to avoid them.

Queues fail in a particular, predictable way. Amazon's engineering account "Avoiding Insurmountable Queue Backlogs" describes queue systems as bimodal: a fast mode where latency stays low because the backlog is clear, and a slow mode where latency grows continuously because work arrives faster than it drains. The unforgiving part is recovery — climbing out of a slow-mode event requires roughly double the processing capacity for the entire duration of the backlog, because you must drain the accumulated work while still keeping up with new arrivals.

Celery's documentation puts the same dynamic in plainer terms, and it is the line every engineer running workers should internalise.

"If a task takes 10 minutes to complete, and there are 10 new tasks coming in every minute, the queue will never be empty."— Celery Documentation, Optimizing

Two operational levers prevent and contain these events. The first is worker tuning: Celery's worker_prefetch_multiplier should be set to 1 for long-running tasks so each worker reserves only one task at a time, and raised to roughly 50–150 for short, high- throughput tasks; mixed workloads belong on separate worker nodes with distinct configurations. Process recycling via worker_max_tasks_per_child and worker_max_memory_per_child contains memory bloat, though setting them too low makes workers spend more time restarting than working.

The second lever is failure isolation. Shuffle-sharding routes each customer to a small, randomly-assigned subset of queues, so when one customer's queue backs up, its neighbours are statistically unaffected — failure isolation without dedicated per-customer infrastructure. This is the queue-layer cousin of API throttling; the same principles appear in our reference on API rate-limiting patterns. BullMQ's own global rate limiter applies the same idea at the queue level: a { max: 10, duration: 1000 } cap holds queue-wide regardless of worker count, so ten workers still process at most ten jobs per second across the whole queue.

Recovery cost
Capacity to drain a backlog
2×

Per AWS, recovering from a slow-mode backlog requires roughly double the processing capacity for the backlog's full duration — you drain the accumulated work while still serving new arrivals.

Bimodal queue behaviour
Long tasks
Celery prefetch multiplier
1

Set worker_prefetch_multiplier to 1 for long-running tasks so each worker reserves a single task at a time; raise it to ~50–150 for short, high-throughput jobs. Mixed workloads get separate nodes.

Short tasks: 50–150
Isolation
Shuffle-sharding subsets
N

Routing each customer to a small random subset of queues means one customer's backlog rarely touches its neighbours — failure isolation without per-customer infrastructure.

Noisy-neighbour defence

07Tool SelectionA seven-tool selection matrix for 2026.

The right tool is the one whose delivery semantics and operational model fit your workload — not the one that matches your primary language by reflex. The matrix below maps seven of the most common choices in 2026 against the dimensions that actually drive the decision: whether they are self-hostable, whether they offer durable step-level execution, and what kind of workload they suit. If your jobs are reactions to domain events rather than direct calls, read this alongside our reference on event-driven architecture and message queues, which frames the broader async picture this matrix sits inside.

Tool
BullMQ
Model & hosting
At-least-once · Node.js · self-hosted on Redis
Best for
TypeScript and Node teams that already run Redis. Rich retry, rate-limit, and scheduling APIs; you own the infrastructure and the operations burden.
Tool
Celery
Model & hosting
At-least-once · Python · self-hosted (Redis/RabbitMQ)
Best for
Python services needing mature worker tuning — prefetch control, process recycling, broad broker support. The default for Django and FastAPI background work.
Tool
Sidekiq
Model & hosting
At-least-once · Ruby · self-hosted on Redis
Best for
Ruby and Rails applications. Thread-based, fast, and battle-tested; keep job arguments to simple JSON-serializable primitives.
Tool
Inngest
Model & hosting
Durable steps · multi-language · managed (self-host option)
Best for
Serverless and event-driven teams that want step-level durability without running Redis. Completed steps replay from saved state on retry.
Tool
Trigger.dev
Model & hosting
Long-running compute · TypeScript · managed (self-host option)
Best for
Jobs that need to run for minutes or hours. v3 runs on dedicated compute rather than serverless functions, removing the function timeout ceiling.
Tool
Temporal
Model & hosting
Exactly-once workflow logic · multi-language · self-host or cloud
Best for
Complex, long-lived, multi-service orchestration and sagas. Exactly-once execution of workflow logic with at-least-once activities and built-in compensation.
Tool
Vercel Workflows
Model & hosting
Durable steps · TypeScript · managed on Vercel
Best for
Vercel-hosted apps needing pause/resume that maintains state for minutes to months — beyond the function duration limits, without separate queue infrastructure.

Read the matrix in two passes. First, the self-hosted row — BullMQ, Celery, Sidekiq — all give you at-least-once delivery and demand that you run and monitor the broker, typically Redis. They are the economical default once you already operate that infrastructure. Second, the durable-execution row — Inngest, Trigger.dev, Temporal, Vercel Workflows — trades that operational burden for step-level replay and, in Temporal's case, exactly-once orchestration. The cost is a different pricing and vendor-dependency profile, which the next section addresses directly.

08Serverless PlatformsWhen to reach for managed job platforms.

Serverless job platforms exist to delete the Redis-operations problem. Inngest uses a step-based durable execution model: each step.run()call is persisted after it succeeds, so on retry the completed steps are skipped and replayed from saved state — the function re-runs from the top, but no finished work repeats. That step-level durability is the genuine differentiator over a plain queue, and it is confirmed directly in Inngest's own documentation. Trigger.dev v3 takes a different route: it runs jobs on dedicated long-running compute rather than serverless functions, which lifts the function timeout ceiling and lets a single job run for minutes or hours.

On the platform side, Vercel Functions (Node.js on Fluid Compute) default to a 300-second maximum duration on all plans, configurable up to 800 seconds on Pro and Enterprise, with concurrency that auto- scales to 30,000 on Hobby and Pro. Payloads above 4.5 MB return an HTTP 413 error. For work that needs to outlast those limits, Vercel documents Workflows (the Workflow DevKit) as the durable option.

"For workloads that require unlimited execution time, use Vercel Workflows, which allow your code to pause, resume, and maintain state for minutes to months without duration limits."— Vercel Documentation, Functions Limitations

The pricing trade-off is the part to reason about before you commit. Both Inngest and Trigger.dev offer a free tier in the region of tens of thousands of runs per month, with paid entry plans that are modest for low-to-moderate volume; self-hosting BullMQ instead trades that per-run cost for the comparatively small monthly cost of a managed Redis instance plus your own operations time. The economics invert at scale: per-run pricing is attractive until volume climbs into the high tens or hundreds of thousands of jobs per day, at which point the fixed cost of self-hosted Redis amortises and managed per-run billing becomes the more expensive path. Treat published vendor pricing as a starting estimate and confirm current tiers on each provider's own pricing page before you model a budget.

One platform caveat for 2026: Vercel Queues is a public beta offering at-least-once delivery, not a generally-available product. It is a reasonable thing to evaluate for Vercel-native pipelines, but do not architect a production system around beta delivery semantics without reading the current Vercel documentation for what it actually guarantees today.

The cost breakpoint
A useful rule of thumb that does not appear in any vendor's docs: managed per-run platforms win on operations simplicity at low and moderate volume, while self-hosted BullMQ on Redis wins on unit economics once sustained volume climbs into the high tens or hundreds of thousands of jobs per day. Model both against your real projected volume before defaulting to either.

09ObservabilityThe signals that tell you a pipeline is healthy.

Background jobs are hard to observe precisely because they are asynchronous: the request that enqueued a job is long gone by the time the work runs, and the job may hop through several queues before completing. A practical set of service-level objective signals covers most of what matters — P95 and P99 job duration, failure rate, queue depth paired with message age, retry-spike rate correlated to dependency errors, and, for scheduled work, missed-run and scheduling-skew alerts.

One metric subtlety is worth absorbing. For queue-entry lag, AWS recommends monitoring AgeOfFirstAttempt — the time from enqueue to the first delivery attempt — rather than ApproximateAgeOfOldestMessage, because the latter folds in retry noise and does not reflect true entry lag. Watching the wrong metric makes a healthy queue with normal retries look like a backlog, and a real backlog look fine.

The cross-queue tracing problem got a clean answer at QCon London 2026, where a team described embedding the originating request's start timestamp into OpenTelemetry trace state. Because trace state propagates across queue hops, any downstream span — however many queues it sits behind — can compute total elapsed time since the original request, closing the async-observability gap that plain span-to-span tracing leaves open in job pipelines.

Latency
P95 / P99 job duration

Track the tail, not the average — slow jobs are where SLO breaches and timeout-induced redeliveries originate. Alert on tail regressions against a rolling baseline.

Tail-latency SLO
Backlog
Queue depth × message age

Depth alone is ambiguous; pair it with age. Use AgeOfFirstAttempt for true entry lag rather than the oldest-message metric, which retry noise inflates.

Depth + AgeOfFirstAttempt
Failures
DLQ depth + retry-spike rate

A growing dead-letter queue is the leading failure indicator. Correlate retry spikes with downstream dependency errors to separate transient blips from systemic faults.

DLQ-depth alert
Schedules
Missed-run & skew alerts

For cron-style jobs, alert on missed runs and scheduling skew — a job that silently stops firing is invisible to duration and failure-rate metrics alone.

Heartbeat per schedule

10ConclusionDesign for the duplicate, monitor the dead letters.

The shape of reliable async work, 2026

At-least-once is the rule, so idempotency is the answer.

Reliable background processing in 2026 still rests on a premise that predates every tool in the matrix: delivery is at-least-once, so the system must be safe to repeat. Everything practical follows from taking that seriously — idempotent consumers, exponential backoff with jitter, dead-letter queues you actually watch, and a transactional outbox when an event must be published exactly when its data is committed.

Tool choice is downstream of the guarantee. Self-hosted queues like BullMQ, Celery, and Sidekiq give you at-least-once delivery and the operations bill that comes with running a broker. Durable-execution engines like Temporal and step-based platforms like Inngest add replayable, crash-safe orchestration — with Temporal's exactly-once guarantee correctly scoped to workflow logic, never to the activities it calls. The newest managed options trade Redis operations for per-run pricing that only pays off below a volume breakpoint you should model against real numbers.

The broader signal is that the hard part of background jobs was never the enqueue — it was the failure path. The teams that run async work well are the ones who treat the dead-letter queue as an alert, not an archive, and who can answer "what happens when this job runs twice" without hesitation. Build for the duplicate, instrument the backlog, and the queue stops being the part of the system you fear during an incident.

Build reliable async pipelines

Make every job safe to run twice and every dead letter impossible to ignore.

Our engineering team designs and operates resilient background-job and event pipelines — idempotent consumers, dead-letter monitoring, durable execution, and the right BullMQ, Inngest, or Temporal choice for your volume — delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Background-job & queue engagements

  • Idempotency and dead-letter monitoring audits
  • Retry, backoff, and backpressure tuning
  • Transactional outbox & guaranteed-delivery design
  • Tool selection — BullMQ / Inngest / Temporal / WDK
  • Observability: SLOs, tracing, and alerting for jobs
FAQ · Background jobs & queues

The questions we get every week.

At-least-once delivery means a message is redelivered until it is acknowledged, so it may be processed more than once — this is the default for almost every production broker, including Amazon SQS, BullMQ on Redis, Sidekiq, and Celery, because it survives network partitions and worker crashes without losing work. Exactly-once delivery means each message is processed precisely once, which is generally impractical to guarantee end-to-end across a network because the acknowledgement that would confirm a single delivery can itself be lost. The practical engineering response is not to chase exactly-once at the broker but to make consumers idempotent, so a duplicate delivery changes nothing. Systems like Temporal achieve exactly-once execution of orchestration logic by building on top of an at-least-once substrate, not by eliminating it.