Event-driven architecture is the pattern where services communicate by emitting and reacting to events rather than calling each other directly. Done well, it buys loose coupling, asynchronous scale, and independent failure domains. Done carelessly, it scatters business logic across invisible event chains and trades synchronous bugs you can read for asynchronous ones you cannot. This reference is about getting the first outcome.
The hard parts of event-driven systems are rarely the broker. They are conceptual: distinguishing the four event patterns that get conflated, choosing between a queue, a pub/sub topic, and a durable stream, and accepting that "exactly-once" delivery is a myth at the network layer. Get those three right and the vendor choice becomes a relatively mechanical decision.
This guide separates Martin Fowler's four EDA patterns, maps the three messaging primitives to the services that implement them, explains where the delivery guarantee boundary actually sits, and walks the reliability patterns — transactional outbox, saga, dead letter queues — that keep production event systems honest. Every number is sourced; where a figure is configuration-specific, it is marked as such.
- 01Events beat requests for loose coupling and async scale.Use event-driven design when producers and consumers should evolve independently, when load is spiky, and when one slow consumer must not block the rest. Keep synchronous request/response where you need an immediate answer in the same call.
- 02Four EDA patterns get conflated — keep them distinct.Event notification, event-carried state transfer, event sourcing, and CQRS each carry different trade-offs. Martin Fowler named them in a January 2017 article and the taxonomy still holds; mixing them up is the root of most architectural regret.
- 03Queue, pub/sub, and stream are three different primitives.A queue delivers each message to one consumer and deletes it. Pub/sub fans one message out to many active subscribers. An event stream is a durable, ordered, replayable log many consumer groups can re-read. Vendors implement different subsets.
- 04Exactly-once delivery is a myth at the network layer.The Two Generals Problem makes true exactly-once delivery impossible across a network. The production answer is at-least-once delivery plus idempotent consumers — 'effectively exactly-once.' Kafka Streams is a narrow internal exception, not a network guarantee.
- 05Kafka 4.0 removed ZooKeeper; pick brokers per workload.Kafka 4.0 (March 2025) made KRaft the only metadata mode, simplifying operations. Kafka, RabbitMQ, SQS, and EventBridge solve different problems; choosing the wrong one multiplies operational debt rather than reducing it.
01 — Why EventsWhen events beat requests.
Request/response is the default for a reason: it is synchronous, legible, and easy to debug. You call a service, you get an answer or an error, and the call stack tells the whole story. Event-driven architecture trades that legibility for three properties that synchronous calls struggle to provide at scale.
Loose coupling. A producer that emits an event does not know — and should not need to know — who consumes it. New consumers can be added without touching the producer, which is what lets teams ship independently. Asynchronous scale. A burst of work lands in a buffer and is drained at the consumer's own pace, so a traffic spike becomes a queue depth rather than a cascade of timeouts. Independent failure domains. If a downstream consumer is down, the events wait; the producer keeps running. In a synchronous chain, one slow dependency stalls everyone upstream.
The cost is real and worth naming. Asynchronous flows are harder to trace, eventual consistency replaces immediate consistency, and the failure modes shift from "the call errored" to "the event was processed twice" or "the event never arrived." Event-driven design is the right tool when the decoupling is worth that tax — not a default to reach for everywhere.
02 — The Four PatternsFowler's four event-driven patterns.
In a January 2017 article, Martin Fowler argued that "event-driven" is an umbrella term hiding four distinct patterns that are frequently conflated — and that confusing them is the source of avoidable architectural mistakes. The taxonomy is nearly a decade old and still the cleanest mental model available, so it remains the right starting point in 2026.
Event notification
The source emits a small event saying something changed, but does not carry the full payload. Recipients call back to fetch current state if they need it. The simplest pattern — and the easiest to lose sight of the larger flow in.
Event-carried state transfer
Each event carries enough data for recipients to update their own local store, eliminating the callback. Gains resilience and reduces load on the source; the trade-off is duplicated data across services and higher receiver complexity.
Event sourcing
Every state change is recorded as an immutable event; current state is derived by replaying the log. Provides a strong audit trail. Git version control is the canonical real-world example of the idea in action.
CQRS
Command Query Responsibility Segregation splits the write model from the read model. Not inherently event-based, but frequently paired with event sourcing. Fowler cautions it is often misused — apply it only where read and write shapes genuinely diverge.
The practical takeaway is to choose deliberately. Event notification keeps coupling lowest but reintroduces a synchronous callback for anything that needs the data. Event-carried state transfer removes that callback at the cost of data duplication. Event sourcing is a commitment to an append-only log as your source of truth, not a messaging tactic. And CQRS is an orthogonal read/write split that people reach for too early — Fowler's own warning is that it is often applied where it adds complexity without paying for it.
"It's very easy to make nicely decoupled systems with event notification... without realizing that you're losing sight of that larger-scale flow, and thus ignoring a serious set of bugs that [can] appear."— Martin Fowler, Chief Scientist, Thoughtworks (martinfowler.com, January 2017)
03 — Three PrimitivesQueue, pub/sub, and stream are not the same thing.
Most comparisons treat messaging as a binary — "queue versus stream." That misses a third primitive and obscures why vendors behave so differently. There are three distinct messaging primitives, and each service implements a specific subset.
A message queue delivers each message to exactly one consumer and then deletes it. That makes queues ideal for task distribution and command dispatch — work items that should be done once by whichever worker grabs them. A publish/subscribe topic fans a single message out to multiple active subscribers at the moment it is published; subscribers that are offline simply miss it unless a durable subscription is configured. An event stream — Kafka, Kinesis — is a durable, ordered, replayable log: multiple independent consumer groups can each read the full stream from any offset, and events are retained for a configurable period (the Kafka default is seven days) rather than deleted on consumption.
| Primitive | Delivery model | Best fit |
|---|---|---|
| Message queue | One message → one consumer, deleted on consume | Task distribution and command dispatch. Each work item is handled once by whichever worker picks it up. Examples: AWS SQS, RabbitMQ classic/quorum queues. |
| Pub/sub topic | One message → many active subscribers (fanout) | Broadcast notifications to several systems at publish time. Offline subscribers miss the message unless a durable subscription is configured. Examples: AWS SNS, Google Cloud Pub/Sub. |
| Event stream | Durable, ordered, replayable log; many consumer groups | Event sourcing, replay, multiple independent readers, retention beyond consumption. Each consumer group reads from its own offset. Examples: Apache Kafka, Amazon Kinesis. |
04 — Delivery SemanticsThe exactly-once myth.
There are three delivery semantic levels. At-most-once: a message may be lost but is never duplicated. At-least-once: a message is guaranteed to be delivered but may be duplicated. Exactly-once: a message is processed precisely once, end to end. Almost everyone wants the third one. Almost no one can have it at the network layer.
The reason is the Two Generals Problem: in a distributed system, a sender can never be certain its message was received without an acknowledgement, and the acknowledgement is itself a message that can be lost. True exactly-once delivery is therefore not achievable across a network. What production systems actually implement is "effectively exactly-once": at-least-once delivery combined with idempotent consumer logic, so that processing the same message twice produces the same result as processing it once.
| Delivery semantic | Where the guarantee lives | Risk & cost |
|---|---|---|
| At-most-once | Network / fire-and-forget | Risk: messages can be lost, never duplicated. Lowest cost, no retries. Acceptable only for disposable telemetry where a dropped event does not matter. |
| At-least-once | Broker (retries until ack) | Risk: duplicates, never loss. The production default. Requires consumers to tolerate replays. Implementation cost is moderate and concentrated in the consumer. |
| Effectively exactly-once | Application (idempotent consumer) | Risk: none if idempotency is correct. At-least-once delivery plus a dedup key or upsert. The realistic target for almost every system. Cost is design discipline, not infrastructure. |
| Exactly-once (Kafka Streams) | Stream processor (internal) | Achievable only inside a single Kafka Streams topology via idempotent producers plus transactions. A narrow, in-cluster exception — not a guarantee you can extend across an arbitrary network boundary. |
Kafka is the one widely deployed system that achieves "effectively exactly-once" inside its own boundary, through two mechanisms. Idempotent producers attach a sequence number to each batch so the broker can deduplicate replayed batches; the same message sent multiple times is written to the log only once. Kafka transactions then allow atomic writes across multiple partitions, so a stream processor can read, process, and write as one all-or-nothing unit. Per Confluent, enabling this exactly-once configuration carries a modest overhead in its published test — roughly a 3% throughput decline on 1 KB messages with 100 ms transactions versus at-least-once. That figure is specific to that configuration; do not generalize it to "always 3%."
The crucial caveat — flagged in our sources and worth repeating — is that AWS SQS FIFO's "exactly-once processing" is a different mechanism. SQS FIFO deduplicates using a deduplication ID within a five-minute window, as AWS documents it; that is not the same as Kafka's transactional guarantee, and the two should not be conflated. Wherever your messages cross a network boundary, the durable answer remains the same: assume duplicates and make consumers idempotent.
"The same message—which is still sent by the producer multiple times—will only be written to the Kafka log on the broker once."— Confluent Engineering, on Kafka idempotent producers
The same idempotency discipline applies the moment your system receives events from the outside world. We cover the consumer side of this in depth in our reference on idempotency and retry patterns for production consumers, which pairs naturally with everything in this section.
05 — Kafka & KRaftKafka's KRaft shift, and what it bought.
Kafka is the reference event stream, and 2025 brought its biggest operational change in years. Apache Kafka 4.0, released March 18, 2025, removed ZooKeeper entirely — KRaft (Kafka Raft) is now the only supported metadata management mode. The path there was deliberate: KRaft was marked production-ready in Kafka 3.3.1, ZooKeeper was deprecated in 3.5, and it is no longer available from 4.0 onward. If you read older posts claiming KRaft was production-ready in "Kafka 3.0," that is incorrect.
The payoff is operational. Removing the separate ZooKeeper ensemble collapses Kafka to a single distributed system to run and reason about. One documented migration cut a 50-node cluster to 35 nodes — roughly a 15-to-30% infrastructure reduction for large clusters — and controller failover times fell from five-to-seven seconds under ZooKeeper to under one second under KRaft. For ops teams planning a 3.x to 4.x upgrade, that is the concrete ROI.
KRaft vs ZooKeeper
Failover dropped from 5–7 seconds under ZooKeeper to under one second under KRaft, per a documented migration. Faster failover means shorter windows of unavailability during broker churn.
Removed in one case study
One documented migration dropped a 50-node cluster to 35 — about a 15–30% infrastructure reduction for large clusters, by eliminating the separate ZooKeeper ensemble.
Self-managed benchmark
A three-broker i3en.2xlarge cluster reached 605 MB/s with 100 partitions, 3× replication and 1 KB messages; p99 end-to-end latency was 5 ms at 200K msgs/s. Self-managed EC2 benchmark — not a Confluent Cloud SLA.
Two ordering facts trip teams up. First, Kafka guarantees ordering only within a single partition, never across partitions — so routing all events for one entity to the same partition (using a partition key such as order_id) is how you preserve per-entity order. Second, the throughput numbers above come from a self-managed benchmark on specific EC2 instance types; they are a useful reference point, not a guaranteed service level, and certainly not a Confluent Cloud SLA. As always, benchmark on your own message sizes and partition counts before committing capacity plans.
06 — Managed LandscapeRabbitMQ, SQS, EventBridge, and Pub/Sub.
Beyond Kafka, four managed options cover most of the field, each with a distinct sweet spot. RabbitMQ made quorum queues (Raft-based replication) the production default in RabbitMQ 4.0 and removed classic mirrored queues; quorum queues support at-least-once dead-lettering and poison-message handling, which classic queues did not. RabbitMQ also offers a Dead Letter Exchange (DLX): messages that expire, exceed a max length, or are negatively acknowledged with requeue=false are routed automatically to a configured dead-letter exchange, isolating poison messages from healthy processing.
On AWS, the three services divide cleanly. SQS Standard gives at-least-once delivery; SQS FIFO adds exactly-once processing with built-in deduplication inside a five-minute window and strict ordering within a message group. SNS is the pub/sub fanout layer. And EventBridge uses content-based event-pattern matching to route events and integrates natively with 100+ AWS services plus third-party SaaS providers such as PagerDuty, Datadog and New Relic — but it explicitly does not guarantee ordering; events may reach targets in arbitrary order. Google Cloud Pub/Sub is the serverless GCP-native option, priced at zero for the first 10 GB per month and $40 per terabyte after, with native ties into Dataflow, BigQuery and Cloud Functions.
EventBridge cost premium vs SNS+SQS · equivalent workloads
Source: sachith.co.uk event-bus cost analysis, Feb 2026The EventBridge premium is not waste — it buys a schema registry, archive and replay, advanced content filtering, and cross-account routing. The question is whether your workload needs those capabilities. If you are simply moving events from one service to another, SNS+SQS is cheaper and entirely sufficient; if you are building a cross-account event backbone with schema governance, the roughly 13-to-19% premium documented in a February 2026 analysis is usually worth it. The choice, as always in event-driven design, is about matching the primitive and the feature set to the actual requirement rather than the most capable option.
07 — Reliability PatternsOutbox, saga, and the dead letter queue.
Three patterns separate a demo from a production event system. The first solves the dual-write problem. When a service must both update its database and publish a message, doing the two as separate operations risks one succeeding and the other failing — a database change with no event, or an event with no change. The transactional outbox pattern fixes this without distributed transactions: the event is written to an outbox table inside the same database transaction as the domain change, and a separate process reads that table and publishes the event. Either both land or neither does.
The production-grade implementation of the outbox relay is Change Data Capture (CDC), typically via Debezium, which reads a PostgreSQL logical replication slot or a MySQL binlog and streams outbox rows to Kafka with sub-second latency and without polling overhead. For the read side of a CQRS setup that consumes these events, the same indexing discipline applies as anywhere else — see our reference on optimizing the read model in a CQRS setup.
The second pattern manages distributed transactions across services. The saga patterndecomposes a long-running transaction into a chain of local transactions, each with a compensating action to undo it on failure. There are two implementations: choreography, where services react to each other's events with no central controller, and orchestration, where a central saga orchestrator directs each step. Neither is universally better — choreography keeps services decoupled but spreads the workflow across many event handlers, while orchestration centralizes the flow at the cost of a coordinator. The choice depends on your team structure and how much you value debuggability over decoupling. When you are drawing those event-driven microservice boundaries, the saga style you can operate should inform where the lines fall.
The DLQ is the cheapest reliability insurance in the entire stack and the most commonly neglected. Without one, a single malformed message can stall an otherwise healthy consumer indefinitely. With one — plus a monitored depth and a rate-limited redrive path back to the main queue once the bug is fixed — a poison message becomes an alert and a ticket rather than an outage. Decide early whether you want a DLQ per topic or a unified one; per-topic isolates blast radius, unified simplifies monitoring.
"A poison message is one that will never succeed no matter how many times you retry it."— AlgoMaster.io, Dead Letter Queues — System Design
08 — How to ChooseMatching the primitive to the workload.
The vendor decision falls out of the primitive decision. Once you know whether the workload needs a queue, a pub/sub topic, or a durable stream — and how much ordering, retention, and replay it genuinely requires — the shortlist narrows itself.
Commands done once, by one worker
Background jobs, command dispatch, work that should be handled exactly once by whichever worker grabs it. A managed queue with a DLQ covers this cleanly. SQS Standard for at-least-once; SQS FIFO when ordering inside a message group matters.
Durable, replayable log
Multiple independent consumer groups, retention beyond consumption, the ability to re-read history and rebuild state. This is the event-stream case. Kafka is the reference choice; remember ordering is per-partition only.
Content-based fanout with governance
Cross-account event routing, schema registry, archive/replay, and integrations with 100+ AWS services and SaaS targets. EventBridge earns its ~13–19% premium here — but it does not guarantee ordering, so do not use it where order matters.
Outbox + idempotency + DLQ
Independent of the broker: write events transactionally with the outbox pattern, make every consumer idempotent so at-least-once is safe, and put a monitored DLQ behind each consumer. These are not optional extras — they are the baseline.
Looking forward, the trend lines are clear. KRaft has made Kafka materially simpler to operate, lowering the bar for self-managed streaming and narrowing one of the historical reasons to default to a managed bus. At the same time, managed event routers like EventBridge keep absorbing more of the glue work — schema governance, replay, cross-account routing — that teams used to build by hand. The likely shape of the next few years is fewer hand-rolled brokers and more deliberate primitive selection, with idempotency and the outbox pattern treated as table stakes rather than advanced techniques. Teams that internalize the three-primitive model now will make fewer of the expensive, hard-to-reverse choices later.
If you are weighing these trade-offs for a specific system — a new event backbone, a monolith decomposition, or a broker migration — our custom web and platform engineering engagements start with exactly this kind of primitive-and-reliability assessment before any vendor commitment.
09 — ConclusionGet the concepts right and the vendor is easy.
The hard part of event-driven architecture is conceptual, not operational.
Event-driven architecture is worth its complexity when you need loose coupling, asynchronous scale, and independent failure domains — and a liability when you reach for it by reflex. The leverage is in the concepts: keep Fowler's four patterns distinct, choose deliberately among the three messaging primitives, and accept that exactly-once delivery does not exist at the network layer.
Once those are settled, the vendor choice is almost mechanical. A queue for task distribution, a stream for durable replay, a pub/sub topic or managed router for fanout — and behind all of them, the same non-negotiable trio: the transactional outbox for reliable publishing, idempotent consumers so at-least-once is safe, and a monitored dead letter queue so a poison message becomes an alert rather than an outage.
The broader signal of 2026 is consolidation toward simplicity. Kafka without ZooKeeper is one fewer system to run; managed routers keep absorbing the glue work. That makes deliberate primitive selection more valuable, not less — because the cost of choosing wrong is paid in operational debt for years. Get the model right first, and the tooling will follow.