Pub/Sub at Scale and the Fan-Out Costs Nobody Budgets For

Pub/sub is not mainly decoupling. It is controlled amplification, and the cost appears downstream.

Layer 2/Queues, Messaging & Event Patterns/Senior/Premium/20 min/Published April 17, 2026

Core insight: Pub/sub is not mainly decoupling. It is controlled amplification, and the cost appears downstream.

Diagram placeholder

The Topic Looks Fine. Four Partitions Are Dying.

Show how average topic health can hide hot-key concentration, and how a few overloaded partitions can create lag across many subscriber groups even while the overall cluster looks healthy.

Placement note: Place near the hot-topic skew discussion.

Diagram placeholder

The Outage Ends. The Expensive Part Starts.

Show that backlog is not just stored delay. It is future traffic that must be repaid while live traffic continues.

Placement note: Place near the backlog growth and catch-up section.

Why This Exists#

Pub/sub is worth its price when one fact needs many independent reactions.

An order changes state. Billing cares. Fraud cares. Search cares. Notifications care. Analytics cares. Support tooling cares. You do not want all of that in the synchronous request path. You want a durable event and independently owned subscribers.

That is the good version.

The expensive version starts later. More teams subscribe because the event already exists. More systems depend on the same stream. Subscriber quality diverges. Freshness expectations diverge. Recovery needs diverge. The producer stays one service. The delivery side becomes shared infrastructure.

The rest is for members.

Finish the essay and open the rest of the archive.

Subscribe $240/year Monthly $24/month

Continue with

Layer 2Read nextpremium

Event-Driven Systems and the Ordering Assumptions That Break Them

Most event-driven architecture writing still sells the easy benefit. It talks about decoupling, async scale, and independent services. Those benefits are real. They are also not the part that usually hurts experienced teams. The harder problem is correctness after you admit asynchronous observation into the system. Once a business fact is published as an event and consumed later by independent services, you no longer have one shared notion of what happened first. You have several local observations, each shaped by partitioning, lag, retries, replay, batching, catch-up behavior, and consumer-specific execution paths. That is where production systems drift into nonsense. Not because the broker is broken. Because the broker kept a narrow local guarantee and the application quietly assumed a stronger global one. Event-driven systems do not give you system-wide order. They give you narrow local order, and everything else must be designed explicitly.

staff24 min

Nearby reading

Layer 2senior

Pub/Sub at Scale and the Fan-Out Costs Nobody Budgets For

The Topic Looks Fine. Four Partitions Are Dying.

The Outage Ends. The Expensive Part Starts.

Why This Exists#

The rest is for members.

Event-Driven Systems and the Ordering Assumptions That Break Them

Fan-Out, Fan-In, and the Latency Explosion Hiding in the Middle

Notification Systems at Scale: Push, Pull, and the Retry Storms in Between

Backpressure Is Not Optional: Load Shedding Under Production Constraints

One Event Is Not One Unit of Work

Intuition#

Baseline Architecture#

Request Path Walkthrough#

Small-scale example

Large-scale example

How It Evolves with Scale#

The Mechanisms, Distinguished#

Trade-offs#

How Senior Engineers Think About This#