Message Queues and the Delivery Guarantees That Aren’t | ArchCrux

Message Queues and the Delivery Guarantees That Aren’t | ArchCrux

Core insight: A broker makes transport promises. Product teams usually need business promises, and the gap between them is where duplication and loss live.

Diagram placeholder

Three Delivery Modes, Three Brokers, One Missing Boundary

Compare at-most-once, at-least-once, and narrow exactly-once across Kafka, SQS, and RabbitMQ without turning into vendor docs. The diagram should teach that the terms are only meaningful when the boundary is named.

Placement note: Place near How It Works or the exactness discussion.

What This Primitive Actually Promises (and where it stops)#

A broker makes transport promises. Product teams usually need business promises.

Those are not the same thing.

Kafka can tell you where a consumer group last committed progress. SQS can tell you whether a message became visible again because it was not deleted in time. RabbitMQ can tell you whether the broker saw a publish confirm or a consumer ack. Those guarantees matter. They are also narrower than the language teams use once the message leaves the broker.

The broker can usually promise that it accepted a publish, stored it according to its own rules, delivered it, and later observed some completion signal such as an ack, delete, or offset commit. What it cannot promise is that your database write, email send, payment capture, inventory reservation, or partner API call happened once.

The consumer owns more than teams like to admit. It decides when to acknowledge. It decides whether the side effect happens before replay protection is durable. It decides whether the operation has a stable business identity or only a fresh delivery identity.

The database or external system owns its own boundary too. A database transaction can promise atomicity inside the database. A payment API may offer idempotency keys, or it may not. An email provider may accept the request and give you no useful way to prove whether the send already escaped.

Then there is the last layer, the one that causes incidents. Product teams often assume someone promised “the user sees this once.” In many systems, nobody actually promised that.

Teams also blur “processed” into a single event. It is not one event. It usually means one of four different things: published to the broker, delivered to a handler, side effect applied, or completion recorded back at the broker. The correctness break lives in the gaps between them.

Diagram placeholder

Where the Guarantee Actually Changes Shape

Show the full delivery path and make the guarantee boundary explicit at each layer: producer, broker, consumer, database, and external side effect. This diagram should teach that no single layer owns the full business guarantee.

Placement note: Place near The Mental Model or at the start of What This Primitive Actually Promises.

The Mental Model#

Use four boundaries.

Publication accepted. Delivery observed by the consumer. Business effect committed. Completion recorded back at the broker.

Most queue bugs come from compressing those four boundaries into one friendly lie called done.

At-most-once means completion is allowed to move ahead of the business effect. You tell the broker “consider this handled” before the effect is durably true. If the process dies in that window, the work is gone.

At-least-once means completion trails the business effect attempt. If the effect happened but the broker never learned that it happened, replay is valid. The broker is protecting against loss by accepting duplication risk.

Exactly-once means one of two harder things is true. Either the business effect and the completion signal share an atomic boundary, or replay across the business effect is made harmless by durable idempotency. Without one of those, exactly-once is usually careful phrasing wrapped around an at-least-once core.

That is the part people learn late. Exactly-once is often a statement about state movement inside a managed boundary, not about what happened in the world outside it.

Nobody notices the dangerous window when it is 40 milliseconds wide. They notice it when a deploy, a rebalance, or a slow dependency stretches it to 4 seconds.

The broker moves messages. The consumer turns messages into facts.

A useful test is simple. Point to the place where replay stops hurting. If you cannot point to it, you do not have a correctness story yet.

How It Works#

At-most-once is the easy contract to run and the easiest one to regret. Kafka does it when offsets are committed before the record is actually processed. RabbitMQ does it when the consumer uses automatic acknowledgements or manually acks too early. SQS does it when a message is deleted before the business effect is durably secured.

The failure mode is simple. If the worker dies after the broker thinks the work is done but before the external effect is real, the work is lost.

At-least-once is what most production systems actually run. The worker receives the message, does the work, and only then acks, deletes, or commits. That is the right bias for many business paths, but it shifts the burden from the broker to the consumer. If the business effect succeeds and the completion signal does not, replay is not a broker bug. It is the contract doing exactly what it said.

Exactly-once is where teams start speaking too broadly. Kafka can make read-process-write behave exactly-once inside a Kafka-controlled transactional boundary. SQS FIFO can suppress duplicate sends inside a limited dedup interval. RabbitMQ publisher confirms can tell the producer the broker accepted a message. None of that proves your consumer only wrote the row once, called the partner once, or charged the card once.

A retry-safe broker is not the same thing as a retry-safe business path.

Diagram placeholder

The Duplicate That the Broker Was Allowed to Create

Make the classic failure chain concrete: the consumer successfully performs the side effect, crashes before ack or commit, and the broker redelivers. This should teach what breaks first versus what the dashboard shows first.

Placement note: Place inside Where the Abstraction Leaks after the paragraph about one side succeeding and the receipt being lost.

Where the Abstraction Leaks#

The abstraction leaks at the moment queue state and business state stop moving together.

Take the most common failure shape. A consumer performs the side effect, then crashes before ack or commit. Kafka is still entitled to replay because the committed position did not advance. SQS is still entitled to redeliver because the message was not deleted before visibility expired. RabbitMQ is still entitled to requeue because the acknowledgement never safely made it back. Every layer can be telling the truth while the customer is already paying for the same action twice.

That leak is worth naming precisely.

What is true: the broker did not observe durable completion. What can still be duplicated: any side effect already applied outside the broker. What can still be lost: only the broker’s knowledge that the side effect happened. What can be replayed safely: operations protected by durable idempotency, not by optimism.

The nasty incidents are not the ones where everything fails. They are the ones where one side succeeded and the receipt got lost.

Kafka makes this especially clean to reason about. The consumer position can move ahead as records are returned by poll, while the committed position remains behind until the client commits. Restart recovers to the committed position, not to the furthest record your code already touched. So a database write can be real while the consumer progress that would suppress replay is still absent. That is not an edge case. That is the normal gap you are operating inside.

SQS leaks through time instead of commits. A worker can still be inside a slow API call when the visibility timeout expires. At that point the system no longer has a crash problem. It has an overlap problem. Another worker can pick up the same logical operation while the first one is still in flight. Teams test crash recovery and miss this because test traffic usually does not stretch the tail long enough to expose it.

RabbitMQ leaks through separation of concerns. Publisher confirms and consumer acknowledgements are solving different problems. A producer can be told the broker safely accepted a message, while a consumer later loses its channel with that delivery still unacked and the broker requeues it. Transport safety is intact. Business certainty is not.

The producer side leaks too. Duplicate production often enters the system before the consumer has made a mistake. A producer gets an ambiguous publish result, retries, and now the broker may hold two transport copies of the same logical operation. If the event identity changes across retries, downstream dedupe is already on the back foot.

A poison message creates a different kind of leak. The queue keeps faithfully redelivering it, which sounds safe until you look closer. The retries consume workers, stretch tail latency, and increase overlap on other messages. Teams see “nothing was dropped” and miss that the system has stopped making progress on the business lane that matters.

The dashboard usually lies by sequence, not by numbers. It shows lag, retry count, visibility extensions, redelivery flags, maybe a rebalance. What is actually broken first is the business boundary. The first incorrect event is not “offset commit failed.” It is “the customer was charged twice,” or “the same inventory was reserved twice.”

One end-to-end chain makes the leak plain. The producer publishes capture-payment(order-91827) and gets an ambiguous publish result. It retries. The broker now may contain two transport copies of the same logical operation. A consumer receives one copy, charges the card successfully, then crashes before ack or offset commit. The broker redelivers because completion was never recorded. A second consumer processes the retried copy or the redelivered copy and charges again. The producer was allowed to retry. The broker was allowed to redeliver. The business was still damaged.

The queue did not lose the message. Your system lost the ability to tell whether it had already acted on it.

Immediate containment in these situations is usually boring and correct. Stop replaying the damaged operation class. Quarantine poison messages instead of letting them churn. Stretch visibility or reduce in-flight concurrency if overlap is the issue. Pause the consumer group if the replay set is expanding faster than you can reason about it.

The durable fix is always the same kind of thing. Make the side effect replay-safe, or make the side effect and the replay ledger move atomically.

Concrete Behavior Under Stress#

Here is the shape that fools teams.

In test, a payment-finalization worker handles 400 messages per second across 20 workers. Average handler time is 120 milliseconds. p99 is under 2 seconds. SQS visibility timeout is 30 seconds. Retry budget is 3 attempts. The dedupe table keeps operation keys for 24 hours. The system looks effectively exactly-once because the ambiguity window between “side effect happened” and “broker knows it happened” is usually tiny.

Now put the same code under sale traffic at 18,000 messages per second. The payment provider slows. Average handler time rises to 900 milliseconds. p99 reaches 11 seconds. A small tail now crosses 30 seconds once provider retries, database contention, and visibility-extension jitter stack on each other. If only 0.4 percent of messages outlive visibility before delete, that is 72 messages per second becoming eligible for redelivery. If those average 2.3 attempts, you are suddenly spending about 165 handler executions per second on operations already in an ambiguous state.

The queue depth may still look survivable. CPU may still look fine. But correctness risk is already rising because the dangerous window is widening. More work is now sitting between “the business effect may already be real” and “the transport still believes it is unfinished.” At 10x scale, rare ambiguity stops being rare enough to ignore.

Kafka shows the same failure shape with different vocabulary. At low volume, the distance between side effect and offset commit may be milliseconds. Under lag, batch processing, and rebalance churn, that distance becomes seconds. The replay set after any crash gets larger. What looked like harmless commit lag in test becomes a pool of already-applied external effects waiting to be replayed.

Backlog does not merely increase latency. It increases the surface area of ambiguity.

What Most Content Gets Wrong

Most content explains queue guarantees as if the broker were the main character. It is not.

The first mistake is treating delivery language as product language. At-least-once is not a statement that the user sees one result. Exactly-once is not a statement that the business action happened one time. Those become true only if the consumer and the side-effect boundary make them true.

The second mistake is making acknowledgements sound like plumbing rather than correctness decisions. Ack timing is not a client-library detail. It is where you choose whether your system is willing to lose work, duplicate work, or pay the cost to make replay harmless. Early ack is a product decision. Late ack is a product decision. Commit interval is a product decision. Visibility timeout is a product decision.

The third mistake is using “exactly-once” without naming the boundary. Kafka can make that phrase true for Kafka-controlled state movement. SQS FIFO can make it narrowly true for duplicate suppression on send within a limited window. Neither statement survives an external API call just because the diagram still has arrows on it.

The fourth mistake is calling weak dedupe “idempotency.” Message UUIDs are not enough if the producer assigns a fresh one on every retry. Insert-if-not-exists is not enough if the side effect already escaped before the insert. A dedupe table is not enough if it remembers keys for 30 minutes and your replay horizon is 2 days. None of that deserves to be called exactly-once.

Most duplicate-charge incidents are not retry bugs. They are success-recording bugs.

My strongest judgment here is simple. If money, inventory, entitlement, or user trust is involved, assume the path is at-least-once until you can explain, in one sentence, why replay across the business boundary is harmless. If that sentence comes out vague, the guarantee is still transport-deep.

How This Connects to Everything Else

Once you stop treating the queue as the correctness center, a lot of adjacent patterns stop looking optional.

Transactional outbox exists because “commit the row and also publish the event” is not atomic by default. Without it, the database can say the order exists while the rest of the system never hears about it.

Inbox tables, dedupe ledgers, or processed-operation stores exist because consumer replay is not hypothetical. They are the durable record that says this logical operation has already crossed the business boundary.

Sagas exist because rollback stops being a local transaction once queues connect multiple systems. A message can trigger a side effect. It cannot magically undo one.

Dead-letter queues matter, but less than people claim. They are containment tools, not correctness proofs. They stop poison work from consuming the hot path forever. They do not tell you whether the message already applied half a business effect before failing.

Observability changes too. Queue depth and lag are necessary. They are not enough. The better signals are duplicate-key hits in the idempotency store, the gap between external side-effect completion and broker completion, visibility-timeout extensions, redelivery concentration on hot keys, and replay age against the dedupe horizon.

The Production Reality

Production punishes imprecise guarantees faster than it punishes slow code.

Handlers run longer than the median. Providers succeed but the response arrives late enough to look like a failure. Channels close after the consumer thought it acked. Operators replay topics during recovery. Someone redrives the DLQ on a Friday. The expensive incidents are often the ones where the broker behaved correctly and the system still hurt users.

In production, retries are rarely isolated. They show up attached to slower databases, longer tail latencies, queue backlog, and operators changing the shape of the system mid-flight.

What saves you is not better vocabulary. It is scar tissue in the design. Stable operation IDs. Durable replay ledgers. Idempotency at the real side-effect boundary. Retry budgets that assume partial success is common enough to design around. Timeouts sized to the tail, not to the median.

There is also a phase where the dashboard looks healthy while correctness risk is already climbing. Throughput is holding. CPU is comfortable. Queue depth is flat enough. But visibility extensions are rising, the gap between side-effect completion and ack is widening, and dedupe hits are quietly increasing. Healthy transport graphs can coexist with a worsening business-correctness posture for longer than most teams expect.

The dedupe store deserves more respect than it usually gets. At small scale it looks like a correctness feature. Under sustained load it becomes a hot path. Hot keys, retention pressure, and regional failover can turn the replay ledger into the next ambiguity source if it is slow, inconsistent, or sized like an afterthought.

This is overkill unless duplicate or missing side effects are materially expensive. For rebuildable analytics, soft notifications, or derived caches, simpler at-least-once handling is often the right decision. For payment capture, shipment creation, entitlement changes, and compliance events, the heavier design is not overengineering. It is the first honest design.

Replays are cheap in the log and expensive in the world.

When You Need to Care vs When You Don’t

You need to care when the side effect is irreversible, externally visible, or expensive to compensate.

Money movement cares. Inventory reservation cares. Access grants and revocations care. Emails and SMS often care more than teams admit, because duplicates erode trust even when they do not corrupt state.

You can care less when the effect is observational, rebuildable, or naturally convergent. Metrics, low-value notifications, derived search indexes, cache warms, and some enrichment pipelines can tolerate controlled loss or duplication better than they can tolerate the complexity bill of stronger replay safety.

What matters is the cost of ambiguity at the boundary where this message becomes real.

Summary#

The guarantee statement that matters is the one that names boundaries correctly:

The broker promises at-least-once delivery to the consumer. The consumer promises to record a durable operation identity before or with the business effect. The database or downstream API promises that the same operation identity cannot be applied twice. The product team does not get to assume “exactly-once” unless those statements line up.

Carry that standard into design reviews. Do not approve a guarantee statement unless it names the boundary where replay stops hurting, not merely the boundary where redelivery becomes unlikely.

That is the last useful test here. When a team says “the queue guarantees it,” ask one more question: which part of the system is still allowed to do the same thing twice? If nobody can answer that cleanly, the guarantee is not finished.