The real request path is longer than the diagram suggests.
A producer emits one event. The broker appends it to a partition leader. Replicas follow. Every subscriber group with an interest in that topic now has its own obligation to fetch, deserialize, process, and advance position. After that, the event stops being “a message” and becomes whatever work each subscriber turns it into.
That is where the symmetry ends.
Small-scale example
Take an order.updated stream:
600 events per second
1.5 KB average payload
12 subscriber groups
replication factor 3
24 partitions
The producer writes about 900 KB per second.
The broker layer sees about 2.7 MB per second of replicated write traffic before overhead.
The subscriber side sees about 10.8 MB per second of logical reads.
So far, still ordinary.
Now look at what the subscribers actually do:
notifications writes push intents
fraud reads features and writes decisions
search writes to an index
support writes timeline entries
analytics batches to object storage
compliance appends immutable records
The stream is shared. The work is not.
One subscriber may batch efficiently and stay cheap. Another may do two reads and one write per event against systems with very little burst tolerance. A topic that looks light at ingress can still be expensive to operate because the same event is being converted into very different kinds of downstream pressure.
Suppose fraud can sustain only 500 events per second during a downstream slowdown. It falls behind by 100 events per second.
That is 360,000 events of lag in one hour. In three hours, it is over one million. The payload volume is not the point. The point is that the fraud system is now making decisions on old state, and recovery will later require extra throughput while live traffic keeps arriving.
Pub/sub behaves politely right up until the first replay window lands during live traffic.
Large-scale example
Now take a customer activity stream:
20,000 events per second
4 KB average payload
150 subscriber groups
replication factor 3
256 partitions
18 cross-region subscriber groups
The producer writes about 80 MB per second.
The broker layer handles about 240 MB per second of replicated write traffic before overhead.
The subscriber side pulls about 12 GB per second of logical reads.
That is the number that changes the architecture. The producer is still doing one write per event. The platform is now supporting 150 independent delivery positions, 150 different processing profiles, and 150 separate ways to fall behind.
The useful way to read that topology is not “one stream with many consumers.” It is “one ingress surface feeding many different throughput ceilings.”
At that point, the first real question is not whether the broker can ingest 20,000 events per second. It is which subscriber class runs out of headroom first. Usually it is the one doing the most expensive side effect against the least forgiving dependency.
I have seen teams celebrate flat publish latency while search freshness was already slipping and replay math was already ugly. The producer was fine. The system was spending tomorrow’s headroom.
Under 100, 10,000, and 100,000 effective subscribers
Under 100 subscribers, most systems still have room to absorb bad habits. Broad topics, imperfect partitioning, and uneven consumers can survive longer than they should.
At 10,000 effective recipients, the unit of cost changes. A 2 KB event now implies roughly 20 MB of downstream payload movement. Ten such events per second implies about 200 MB per second of delivery load before retries and redelivery. Waste that was once background noise starts showing up in bills, lag, and recovery time.
At 100,000 effective recipients, the system is no longer doing direct fan-out in one stage, whether the design admits it or not. There will be derived topics, regional layers, filtered streams, or delivery services absorbing the next multiplication step. The question is no longer whether the broker accepted the write. It is where fan-out is allowed to materialize and where it must be cut down.
That is where pub/sub stops being a neat integration pattern and starts acting like distribution infrastructure.
Pub/sub usually scales cleanly just long enough to teach the wrong lesson.
A stream that works at 500 events per second with 10 subscribers often still works at 5,000. Teams conclude that the pattern scales well. What they usually missed is that rate growth is rarely the only growth. Subscriber count grows too. Consumer behavior diverges. The stream gets reused. Replay becomes normal. Lag stops being rare.
The shape of the system changes before the diagram does.
What changes at 10x
Start here:
500 events per second
10 subscriber groups
That is 5,000 group-deliveries per second.
Now move to this:
5,000 events per second
45 subscriber groups
That is 225,000 group-deliveries per second.
Traffic grew 10x. Delivery obligations grew 45x.
That is the transition that catches teams off guard. The producer graph rises by one order of magnitude. The system-wide work expands by much more because the stream has become more useful and more shared at the same time.
In pub/sub, 10x more ingress is often survivable. 10x more ingress plus 4x more subscribers is a different architecture.
This is where the first bottleneck starts to matter. It is usually not the broker. It is the first subscriber class whose per-event work is expensive and whose downstream dependency has no patience for burst or replay.
A fifteen-minute backlog is no longer a nuisance. It is stored recovery work. A replay is no longer admin cleanup. It is a second traffic phase.
By the time lag is obvious on the dashboard, the expensive part of the mistake has usually already happened.
Hot-topic skew
Skew is where average throughput planning stops being useful.
Suppose the 20,000 events per second topic has 256 partitions. Average load is about 78 events per second per partition.
That number is too comforting.
Now assume one live event, one merchant, or one region contributes 3,000 events per second concentrated on a few hot keys. If those keys land on 4 partitions, those partitions are now carrying around 750 events per second from that hot slice alone, before the rest of the stream is counted.
Every subscriber reading those partitions inherits that concentration. Lag appears there first. Freshness breaks there first. The business notices there first.
The partition that hurts first is rarely random.
This is why partition count is not throughput. It is potential parallelism, bounded by distribution. Extra partitions help only if the heat can actually spread. If ordering semantics pin hot keys to narrow lanes, the unused partitions are just empty road around the pileup.
Partition width and subscriber group sizing
Partition width and subscriber sizing have to be reasoned about together.
A topic with 128 partitions gives a consumer group at most 128 useful slots for concurrency. If the group runs 20 workers, most of that width is idle. If it runs 128 workers, it may fully consume the available partition parallelism but overwhelm the system it writes into. If it runs 300 workers, many of those workers are not helping.
That is the difference between broker-scale and system-scale in one sentence. The broker can expose concurrency. The rest of the architecture still has to survive it.
This is where many “we scaled it” stories go soft. Teams add partitions, then add subscriber workers, then discover that lag has moved from the broker-facing side to the database, search cluster, cache layer, or third-party dependency behind the subscriber. The system got wider. It did not necessarily get safer.
When one topic becomes infrastructure-critical
A topic becomes infrastructure-critical when too many systems quietly start assuming it will always be there, always be current enough, and always be replayable.
That can happen before raw throughput looks dramatic.
Once one stream feeds notifications, billing audit, search, ML features, analytics, compliance, and user-visible state, it is no longer just a topic. It is shared infrastructure with unequal consumers attached to it.
That changes the operating model.
Ownership hardens because someone now has to decide who may subscribe, how retention works, what replay is allowed to cost, and which schema changes are acceptable.
Isolation matters more because best-effort readers cannot be allowed to compete freely with freshness-critical readers on the same hot path.
Recovery stops being a team-local concern because replay on that stream is now a traffic event for a large part of the company.
The clean diagram still says “topic.” In production, it has become a platform artery.
Several scale questions get collapsed into one vague conversation about “pub/sub capacity.” They should not.
Publish acceptance is about whether the producer can write durably enough and fast enough.
Broker throughput is about whether the broker can append, replicate, retain, and serve.
Filtering is about whether irrelevant work happens at all.
Topic design is about which workloads are forced to coexist.
Partitioning is about where concentrated pain is allowed to land.
Subscriber throughput is about how much real work each consumer class can perform per second.
System-scale is about whether all of that can hold together under burst, skew, replay, and failure.
Filtering helps when it prevents real work from happening. It does not help much when every subscriber still reads the raw topic and discards most events locally.
Topic splitting helps when it separates materially different cost profiles or freshness contracts. It does not help when the same consumers still need all the new topics.
Partitioning helps when it creates usable concurrency. It does not help when the hottest keys still dominate or when the downstream sink saturates first.
The important question is always the same: did total work go down, or did the shape of the work merely move?
Pub/sub is often the right choice. It is just rarely the cheap choice people think it is.
The producer gets a cleaner path. Subscribers get independence. Replays become possible. Bursts can be buffered.
The cost is that coupling moves from code to capacity.
Broad topics maximize reuse and multiply fan-out.
Narrow topics improve isolation and increase operational sprawl.
More partitions improve concurrency and can worsen skew visibility, rebalance overhead, and downstream contention.
More consumers improve steady-state throughput until recovery arrives and proves the rest of the system was never sized for replay.
There is no free version of this pattern. There is only a choice about where multiplication occurs and who has to survive it.
Failure Modes
Pub/sub failures usually begin away from the publish path.
The write succeeds. The amplification fails somewhere downstream.
That is why the expensive failure shapes are almost always about concentration, lag, replay, retry, or bad payloads being multiplied across many readers.
One hot topic saturates a few partitions while other topics look fine
This is the failure shape that cluster-level dashboards hide best.
One topic goes hot. A few keys dominate. A small number of partitions start carrying much more than their share of the work. Every subscriber group assigned to those partitions inherits the hotspot while the rest of the cluster can still look ordinary.
Early signal. A handful of partitions show much higher append rate, fetch volume, or lag growth than the rest. Several subscriber groups begin lagging on the same partition range.
What the dashboard shows first. Broad cluster health often still looks fine. Producer success is clean. Overall ingress is acceptable. The first honest dashboard is partition-local: hottest partitions, bytes per partition, lag by partition, and fetch imbalance.
What is actually breaking first. Effective parallelism breaks first. On paper the topic may have 128 or 256 partitions. In practice the hot keys have collapsed the important work onto a few narrow lanes. The business sees the hottest slice degrade while average broker health still looks safe.
Immediate containment. Protect the critical subscribers first. Pause or slow non-critical consumers that are competing for the same hot partitions. If the system allows it, isolate the hottest tenant or event family rather than letting it keep poisoning shared headroom.
Durable fix. Revisit key choice, topic boundaries, and where fan-out is being materialized. Adding partitions alone is not enough if the same keys still pin the heat to the same places.
Longer-term prevention. Capacity planning for pub/sub has to ask where skew can appear and how ugly it can get. Average throughput planning is not enough.
One lagging subscriber turns backlog retention into a platform problem
A lagging subscriber looks local at first. It is not.
One important consumer falling behind changes retention pressure, recovery cost, and future replay traffic for the platform. What is being stored now is throughput the system will later have to serve.
Early signal. Lag grows faster than the subscriber can plausibly repay. Age of oldest unprocessed event rises. Estimated catch-up time stretches. Retention usage climbs even though ingress has not changed much.
What the dashboard shows first. The obvious graph is consumer lag. The useful ones are lag growth rate, oldest-event age, projected recovery duration, and retention consumption.
What is actually breaking first. Freshness breaks first. A lagging subscriber stops representing present state long before storage becomes critical. If that subscriber powers user-visible or correctness-sensitive behavior, the incident is already real.
Immediate containment. Decide whether that subscriber should still be taking live traffic. Sometimes the right move is to pause it and preserve backlog rather than let it keep spending platform headroom badly. Sometimes the right move is to degrade its work: fewer enrichments, fewer writes, less per-event cost.
Durable fix. Fix the downstream bottleneck. It is often a database write path, an enrichment dependency, or a state store that cannot absorb replay plus live load.
Longer-term prevention. Treat lag as time debt with a declared budget. Important subscribers need explicit freshness targets and replay budgets, not vague hopes that they will “catch up later.”
A subscriber deployment triggers replay or catch-up pressure
A deployment does not need to fully break a subscriber to hurt the platform. A partial drop in effective throughput is enough.
Commits slow down. Workers restart more. Lag starts to climb. Now normal traffic is being converted into backlog, and recovery later will have to process both backlog and live ingress at the same time.
Early signal. Restart count rises. Commit progress slows. Rebalance frequency increases. Processed-events-per-second drops without a corresponding change in topic ingress.
What the dashboard shows first. Teams often first notice churn: worker instability, commit slowdown, or throughput collapse after rollout. Lag follows.
What is actually breaking first. Headroom breaks first. The subscriber may still be processing some traffic, but it is no longer preserving enough spare capacity to survive replay.
Immediate containment. Roll back fast, then control catch-up. A recovered subscriber running flat out can create the second incident by hammering the same downstream systems that were already short on margin.
Durable fix. Subscriber rollout safety needs to be judged by throughput preservation and lag slope, not just crash rate or pod health.
Longer-term prevention. Hot-topic subscribers need canary rules that look at delivery behavior under real load. A subscriber attached to shared infrastructure is not just another stateless deployment.
Publishers look healthy while subscribers accumulate invisible debt
This is the most common lie in pub/sub operations.
The producer is healthy. The brokers are healthy. The stream is live. Several important subscribers are current only because nothing unusual has happened yet.
Early signal. Subscriber throughput is tracking ingress too closely. Commit latency stretches. Downstream dependency saturation rises. End-to-end freshness drifts upward before obvious lag appears.
What the dashboard shows first. The main dashboard often stays green. Publish success, append latency, and broker health do not move much. The first useful signals are subscriber headroom, oldest fully processed event age, and backlog paydown capacity.
What is actually breaking first. Recovery posture breaks first. A subscriber that is technically current but running with no spare capacity is already in a dangerous state.
Immediate containment. Reduce optional work in expensive subscribers. Pause low-priority readers if necessary. Protect the consumers with hard freshness requirements before visible lag forces the choice.
Durable fix. Capacity planning has to move from broker-scale to system-scale. Sustainable throughput for important subscriber classes has to be measured under live plus recovery conditions, not only under happy-path averages.
Longer-term prevention. Make headroom explicit. Alert on shrinking margin, not only on obvious lag.
Retry or redelivery multiplies downstream load
Retry is where a local error turns into a traffic shape.
One transient failure becomes several attempts. Each attempt may reread the event, redo the same expensive work, and hit the same downstream dependency that was already struggling.
Early signal. Retry count rises. Redelivery rate rises. Downstream request rate grows faster than committed progress. The ratio between attempted work and completed work gets worse.
What the dashboard shows first. Subscriber errors or downstream saturation usually show up first. The more revealing number is attempted deliveries per committed event.
What is actually breaking first. Dependency headroom breaks first. Redelivery takes a system that was already in trouble and feeds it more of the same work.
Immediate containment. Stop multiplying failure. Back off harder. Add jitter. Trip circuit breakers. Move poison work out of the hot path.
Durable fix. Make sinks idempotent and consumers replay-aware. Repeated delivery should not recreate full downstream cost every time.
Longer-term prevention. Retry policy on a hot topic is traffic-shaping policy. It should be reviewed that way.
One malformed event fans out bad work across many consumers
This is the cleanest example of amplification doing damage.
A bad payload gets published once and then consumed many times. One subscriber fails deserialization. Another retries an impossible write. Another emits corrupted derived state. Another dead-letters correctly but still burns capacity doing so.
The same bad event now exists in many failure paths.
Early signal. Several unrelated consumers begin erroring on the same topic within a short interval. Dead-letter volume rises across multiple groups at once.
What the dashboard shows first. The clue is correlation across consumers, not broker distress. Publish success often stays high because the malformed event was stored successfully.
What is actually breaking first. Useful throughput breaks first. Consumer fleets start spending capacity proving the event is bad instead of processing good work.
Immediate containment. Quarantine the bad payload quickly. Stop new copies if the producer is still emitting them. Route around known-bad offset ranges if the business can tolerate it.
Durable fix. Strengthen producer validation and schema discipline. Transport-level validity is not enough.
Longer-term prevention. Assume bad data will be amplified like good data. Build schema enforcement and poison-message containment accordingly.
The common pattern is simple. The publish succeeds. The system fails at the point where one logical event has already been multiplied into more work than one layer can absorb.
When To Use / When NOT To Use
Pub/sub is worth its price when the same fact has durable value to many independently owned systems and the organization is willing to own the lag, replay, and recovery surface that comes with that choice.
It is a poor trade when fan-out is mostly ceremonial, when the same few systems still define correctness synchronously, or when nobody is prepared to govern a hot stream once it becomes common infrastructure.
Not every consumer needs direct access to the raw firehose. Some should read derived topics. Some should read materialized state. Some should get batch exports. Direct subscription is a privilege. On the hottest streams, it is also a cost decision.
Senior engineers stop asking whether the broker can take the write and start asking what the write commits everyone else to doing.
They count deliveries, not publishes.
They model 10x transitions in delivery obligations, not only ingress.
They look for the first bottleneck, not the loudest component.
They assume skew will be ugly, retries will multiply pain, and replay will arrive at the least convenient time.
They know the difference between a stream that is merely popular and a stream that has become infrastructure.
Most of all, they stop being seduced by the clean diagram.
Because the clean diagram is true in only one narrow sense: the producer emitted one event.
Everything after that is cost, concentration, and recovery. That is the real shape of pub/sub at scale. Once a stream becomes important enough, you are no longer operating an elegant decoupling pattern. You are operating a distribution system with uneven subscribers, finite headroom, and very expensive mistakes.
That is when the question changes.
Not “did the publish succeed?”
“What did that publish just ask the rest of the system to survive?”