News Feed Architecture and the Fan-Out Decision That Defines Everything
Most feed-system content spends too much time on tables, caches, and ranking boxes, and not enough time on the real question: where does the fan-out bill land, who pays it under skew, and what happens when one account with 40 million followers posts at the worst possible moment?
Core insight: Outbox and inbox are not primarily messaging patterns. They are truth-boundary patterns.
Outbox says that if a local transition is durably true, the intent to publish it must become durably true in the same commit. Inbox says that if a consumer acts on a message, the fact that it already acted must survive retries, crashes, and redelivery.
That is useful, and narrower than many teams want it to be.
The separation that matters is this: producer truth, publication truth, and consumer truth are different states. Outbox strengthens the first boundary and helps with the second. It does not certify the third.
Outbox reduces producer-side dual-write ambiguity. Inbox reduces consumer-side duplicate ambiguity. Neither makes the workflow atomic. Neither lets downstream systems stop thinking about idempotency, replay, ordering, or whether an old event still means the same thing under new code.
If the reader keeps one idea from this article, it should be this: the pattern is only as good as the boundary of truth it protects, and the recovery model behind that boundary.
Warning: outbox solves a producer problem. It does not certify downstream correctness.
This topic exists because local truth and propagated truth drift apart constantly.
An order service commits an order row, then crashes before publish. A payment service updates authorization state, then times out while notifying downstream systems. A signup flow commits the user, while email, entitlements, CRM, and fraud learn about it later, maybe late, maybe twice, maybe after partial failure.
Ad hoc retries do not solve that well. They help only while the process still has credible memory of what it was trying to do. Once the local commit succeeded and the publish step became uncertain, retries become guesswork. You either miss propagation or repeat it without a durable record of whether it already happened.
That is where outbox earns its place. It makes publication intent durable after local commit.
The consumer side has the inverse problem. Brokers redeliver. Workers crash after side effects but before ack. Replay jobs rerun old traffic. Handler behavior changes over time. "Basically idempotent" stops being good enough as soon as duplicates hit something expensive, irreversible, or user-visible.
That is where inbox earns its place, when transport-level duplicate memory is actually the right consumer boundary. It is not always the right boundary.
Getting the commit boundary right is only the opening move. The real pain usually starts later, when publication is delayed, duplicates arrive in a different envelope, or recovery replays technically valid messages into business effects that are no longer safe.
An order creation request returns success because the order service committed order state. That is one truth boundary. Everything after that is propagation, not part of the original commit, even if the product or support team mentally treats it as one business action.
That gap is the subject.
If inventory reservation, payment follow-up, entitlement grant, and customer notification all depend on the order-created event, the question is not "is the broker reliable?" The question is whether the system has a durable memory that publication must still happen, what it can honestly claim about publication progress, and whether downstream consumers can survive seeing that event more than once.
The same logic applies to signup. Creating the account may be authoritative in the identity service. Sending the welcome email, initializing preferences, creating the CRM contact, provisioning entitlements, and logging analytics are downstream interpretations of that local truth. Outbox helps the rest of the world eventually learn about the signup. Inbox can help some consumers decide whether they already acted.
The non-obvious part is that these patterns do not remove ambiguity. They move it into lag, backlog, replay, and repair. That is often a good trade because those problems are easier to observe and repair than silent dual-write loss.
Teams still monitor the wrong things. They watch whether the poller is up, whether the broker is taking publishes, and whether consumers are running. Users feel whether downstream truth is fresh and whether side effects happened once.
Producer Truth, Publication Truth, and Consumer Truth
Show the producer transaction boundary, the relay and publication boundary, and the consumer duplicate-control boundary, making clear that outbox and inbox strengthen different truths.
Placement note: Place immediately after Baseline Architecture. The key visual point is that producer truth, publication truth, and consumer truth are separate states.
The producer path is simple: commit domain state and an outbox row in the same local transaction, then let a separate relay publish that row to the broker.
The consumer path is only slightly more complicated, and only when it deserves to be. Receive a message, cross the right durable boundary for duplicate control, apply handler logic, then ack only after the consumer has durably advanced its own state. Sometimes that boundary is a dedicated inbox table. Sometimes it is business idempotency in the domain store. Treating inbox as automatic is one of the mistakes that bloats these systems.
At small scale, this is quiet. A service doing 50 requests per second, producing 50 to 100 events per second, can often run a straightforward outbox poller with a modest batch size and a clean index on unpublished rows. A consumer handling a few hundred messages per second can keep a dedup table for a sensible retention window without much trouble.
The mistake is assuming that because the pattern is quiet at 50 writes per second, it will stay quiet at 500. Producer write throughput and outbox drain throughput are different capacities. The first is local durability. The second is discover, lock, serialize, publish, mark, and clean up.
This is also where teams first lose track of what "published" means. Polling relay or CDC relay, the mental model is the same, but visibility, lag shape, and recovery behavior are not.
The request reaches the order service. Validation passes. The service writes the order row, perhaps a payment intent reference, and an outbox row like OrderAccepted. The transaction commits. At that point the order exists. Producer truth is settled. Inventory does not know yet. Notification does not know yet. Entitlements do not know yet. That is fine, provided the system is honest about what success means.
A separate publisher reads pending outbox rows and sends them to the broker. This is where publication truth gets its own failure boundary. Many teams blur "committed" and "published" because the happy path is fast. The trouble starts when those two clocks separate and nobody notices quickly enough.
The publisher then records publication progress. That step sounds routine until the first uncertain broker ack, retry, or relay restart. Outbox greatly improves the producer side. It does not make publication state magically obvious.
Now the paths diverge.
A notification consumer mainly cares about not sending the same email twice. An inbox table keyed by event ID or business dedup key can help. A ledger-adjacent consumer needs something stronger. It cannot merely suppress exact duplicate transport IDs. It needs business idempotency keyed to the economic action itself. If the same debit arrives later in a different envelope, transport dedup is irrelevant.
That distinction matters. Inbox helps with repeated delivery. It does not define financial truth.
Now take signup. The identity service commits the user and writes UserRegistered to the outbox. Marketing, fraud, CRM, and entitlement consumers process it later. If the publisher lags by five minutes, the user exists but their welcome email, trial entitlement, and support visibility may not. The system is not down. It is half-right, which is often worse because humans trust it.
This is one of the places where the system can be commit-correct and operationally misleading at the same time. The source row is there. The request succeeded. Support may even see the user. Meanwhile the downstream surfaces the customer actually feels are already wrong.
At 2,000 signups per second with six downstream consumers, five minutes of producer lag is 600,000 undrained outbox rows and 3.6 million delayed downstream deliveries. The catch-up wave can still dominate the next hour.
One of the most misleading incidents in this shape is when the request path is healthy, the outbox is being written, the publisher process is running, and the broker is accepting messages, but a bad query plan or lock regression means drain rate is now below write rate. The dashboard says the pipeline is alive. What it is actually doing is losing freshness every minute.
At small scale, outbox and inbox feel clean. One database, one poller, one consumer group, enough headroom that lag is unusual.
At 10x scale, the pattern changes character.
The outbox stops behaving like a convenience table and starts behaving like a write-heavy queue inside your primary database. Pollers need disciplined batching. Indexes on unpublished rows get hot. Cleanup becomes part of capacity, not housekeeping. A short broker outage can turn into burst reads against the same database you are trying to preserve for user-facing traffic.
The first producer-side bottleneck usually appears before the broker is even stressed. It is often the pending-row query, the lock pattern among concurrent pollers, the update path that marks publication progress, or the cleanup path that keeps old rows from bloating storage and indexes.
That is why "the broker can do 50,000 messages per second" is usually the wrong comfort. The broker may be fine while the producer database is already paying queue costs it was never sized for.
The inbox changes in parallel. At a few million records, dedup lookups are cheap. At hundreds of millions, retention and index layout become first-class capacity decisions. "Keep processed IDs for a while" is not an architecture. It is a storage policy someone forgot to design.
The first consumer-side bottleneck is often the dedup boundary, not the business logic. Unique index contention, hot partitions from poor dedup keys, write amplification from recording every processed message, and retention jobs fighting with live traffic show up before the handler looks obviously slow.
At this point replay stops being exceptional and starts becoming routine. That is also when weak idempotency assumptions start showing up in places that looked fine in steady state.
A useful small-scale example is an order platform doing 150 successful writes per second, with one outbox event per order and four downstream consumers: notification, fulfillment, analytics, and entitlement. The producer creates only 150 outbox rows per second. The estate is really absorbing 600 downstream deliveries per second before retries, redelivery, or replay.
Assume the outbox poller drains 300 rows per second in steady state. That looks safe. Then a deployment changes the outbox query plan, locking gets noisier, and effective drain falls to 120 rows per second for 30 minutes. The order service stays healthy. Backlog still grows by 30 rows per second, which becomes 54,000 undrained rows. When the issue is fixed, even a recovered drain rate of 360 rows per second gives only 210 rows per second of net catch-up because live traffic is still arriving at 150.
Those 54,000 producer rows then become 216,000 downstream deliveries. If the entitlement consumer safely handles only 250 messages per second because each message hits a dedup store and a transactional grant path, the producer may look recovered while the slowest consumer is still materially behind.
Now take a larger example. An account lifecycle service handles 4,000 writes per second during a campaign spike. Each write emits one durable event to six consumers. That is 24,000 downstream deliveries per second before retries. If producer-side outbox drain falls below write rate for 20 minutes, backlog is not "a bit behind." It is up to 4.8 million durable rows if drain fully stalls.
Suppose the relay recovers and can publish 10,000 outbox rows per second. Net catch-up is only 6,000 rows per second if live writes are still 4,000. Clearing 4.8 million rows takes about 13 minutes in the ideal producer view. But downstream consumers now face 28.8 million deliveries from that backlog. If two critical consumers write to inbox tables and can safely process only 3,000 messages per second each, their net catch-up over live traffic may be closer to 2,000. Their recovery time is now closer to 40 minutes, assuming no provider throttling, no replay interference, and no schema surprises.
That is the real lesson. Total event volume is not safe reliability capacity. The limiting factor is usually the slowest durable boundary in recovery, not nominal publish rate.
At 10x scale, the parts that usually stop scaling cleanly first are the outbox table acting like a queue in the producer database, the inbox dedup store when retention is vague or write volume is high, and replay throughput because safe replay rate is constrained by side effects and live traffic interference.
My view here is blunt: if your primary database is already paying visible outbox queue costs and you are still telling the scaling story in broker numbers, you are measuring the wrong thing.
The big win of outbox is that it solves the producer-side dual-write problem far better than "commit, then publish, then retry." That is a meaningful correctness gain. For order acceptance, payment-adjacent transitions, and account lifecycle events, I would take that over request-path neatness.
The cost is asynchronous truth. The request can succeed while the rest of the estate is behind. That is acceptable only if the product and operations model can tolerate the visibility gap.
Inbox has an equally honest trade. It gives you a durable way to suppress or bound repeated consumer execution. I would gladly pay for that in notification, entitlement, webhook, or ledger-adjacent consumers where duplicate effects are ugly. I would not pay for it automatically in every analytics consumer or low-stakes projection updater.
This is overkill unless the duplicate cost is materially worse than the machinery cost.
That caveat matters. Many systems need only producer outbox plus competent consumer idempotency. Some consumers need domain idempotency keyed by business identity and gain little from a separate inbox table. Full outbox plus inbox everywhere is usually architecture-by-anxiety.
The two trade-offs that matter most are replay safety and ordering dependence. These patterns make replay more feasible, but not automatically safe. Replaying old events through new code is often reinterpretation, not recovery. Replaying old events through code that also emits external side effects is often worse. And inbox helps with repeated handling, but it does little for consumers that quietly depend on seeing state transitions in order or only seeing the latest version once.
There is also a trade that gets worse with growth: fanout elegance versus recovery cost. One producer event delivered to eight consumers looks clean on a diagram. In operations, it creates eight lag surfaces, eight dedup policies, eight replay stories, and eight opportunities for one slow consumer to dominate recovery time.
Stronger freshness monitoring, longer inbox retention, and replay suppression modes all help. They also add real operational weight. In mature systems, replay tooling starts looking less like a script and more like a second control plane.
A common outbox failure starts with a correct producer transaction and a weak publication recovery model. The trigger is often ordinary: a slow query plan, a lock regression after adding poller concurrency, a poison row, or cleanup contending with live drain. The visible symptom is mild lag or a rising pending-row count. The hidden impact is stale downstream truth. Orders are accepted, users are created, request success looks fine, but notifications, entitlements, or support projections drift further behind. The failure spreads through every consumer that treats the event stream as timely. It is hard to debug because the producer is correct, the outbox is being written, and the broker may be healthy. Experienced teams monitor freshness, not just liveness, but freshness has to be consumer-specific and consequence-specific. A single global lag number is often another green graph that hides the wrong thing.
Another common failure is duplicate publication or duplicate consumption that the system technically tolerates in the wrong place. The trigger might be a publisher retry after uncertain broker ack, a consumer crash after side effects but before ack, or overlapping replay and live traffic. The visible symptom may be almost nothing. Consumer lag looks normal. Broker throughput looks healthy. The hidden impact is duplicate emails, duplicate entitlement grants, or duplicated downstream writes that later create bad compensation or misleading analytics. The real problem is usually not the broker retry. It is that authority and idempotency were assumed at different boundaries.
Inbox failures are quieter. The trigger is often growth, not crash. The table gets large enough that lookups slow down, cleanup falls behind, or retention silently shrinks to save storage. The visible symptom is consumer slowdown or occasional duplicates long after everyone thought the consumer was safe. The hidden impact is worse: operators stop trusting whether the inbox still remembers the right history. If your replay window is 14 days and your dedup memory is 3 days, you do not really have safe replay for that consumer.
Replay side effects are where exactly-once language collapses fastest. The trigger is often a repair task that succeeds technically. Messages republish cleanly. Consumers stay up. Lag graphs improve. The hidden impact is that old webhooks fire again, old emails resend, or an entitlement handler grants access again because the inbox guarded only exact message IDs and the replay rewrapped the event. That is why replay often becomes the second incident.
Successful drain catch-up is not the same thing as reliable recovery. The backlog graph can look great while customers are still getting duplicate emails or while the wrong consumer state is being rebuilt faster than anyone can inspect it.
Schema evolution failures are slower and meaner than outright crashes. The trigger is a producer schema change followed by lag or replay. The visible symptom may be partial success. Messages publish. Some consumers stay current. Others silently misread fields, assume defaults, or ignore a version they were never tested against. The hidden impact is divergent truth across consumers that all look alive.
Two warnings deserve to stand out.
Warning: a healthy publisher can still be feeding stale truth.
Warning: replay that "works" can still be wrong in business terms.
One incident shape shows up constantly: the first alert is not "publication failed." It is a customer report that the order confirmation email did not arrive, while support already sees the order and finance already sees the charge intent. Request success is normal. The publisher process is healthy. The broker topic is receiving traffic. The missing metric is that notification and entitlement freshness are already fifteen minutes behind the order table.
Another one is more deceptive. Replay starts after a consumer bug fix. Throughput looks great. Lag falls. Then the email provider rate-limits, the dedup store gets hotter, and operators realize they are replaying both internal projections and external side effects through the same handler. The system is technically recovering while operationally getting worse.
Outbox reduces one blast radius and introduces another.
It reduces producer-side dual-write risk. The producer no longer has to settle publication success inline. That is good.
But it introduces propagation lag as a new failure surface. If the publisher is unhealthy, every downstream consumer sees stale reality at once. One stuck publisher can delay notifications, entitlements, search indexing, fraud scoring, and support visibility together. The failure propagates through time rather than immediate error.
Inbox contains duplicate side effects inside a consumer boundary, but it does not contain semantic confusion. If two events legitimately describe the same business action differently, or replay interleaves badly with live traffic, the inbox may suppress the wrong thing or allow the wrong thing. Technical dedup can limit mechanical duplication while leaving business ambiguity untouched.
The most dangerous blast radius is organizational. Dashboards show the source of truth updated. Everyone assumes the rest is seconds behind. Then backlog recovery, retries, and replay collide, and suddenly five systems are "correct" according to five different clocks.
At scale, blast radius is set by the slowest boundary, not the first failing one. The producer may recover quickly while a downstream entitlement or notification consumer remains 40 minutes behind because its dedup store, side-effect quota, or replay policy is tighter. Recovery objective is therefore not just a broker metric. It is the maximum lag across all authoritative or user-visible consumers.
There is also a subtler spread pattern. A correct producer can still feed misleading consumer outcomes. The order service publishes the right fact. The notification consumer dedups incorrectly and drops the only send that mattered. The analytics consumer counts both original and replay. The support projection is behind. Now three downstream truths disagree even though the source event was valid.
Implementation is not the expensive part. Stewardship is.
You need to know which outbox rows are pending, published, delayed, poison, or safe to archive. You need metrics that distinguish normal lag from accumulating danger. You need a replay policy that is explicit about rate, ordering expectations, side-effect suppression, and operator visibility.
Inbox demands even more discipline than teams expect. What is the dedup key? Exact message ID, business action ID, or a composite of tenant, entity, and version? How long is dedup memory retained? What happens when retention expires and a delayed duplicate arrives? Where is the lookup stored, and how does its write path behave under burst recovery?
This gets painful before it gets obviously broken. The first warning is often operational friction. On-call sees growing lag without user-facing errors yet. Repair jobs exist, but no one fully trusts them. Teams hesitate to replay because they are not sure which side effects are safe to repeat.
Backlog interference is another expensive truth. Recovery traffic is rarely isolated as well as the diagram implies. Catch-up publishing and consumer reprocessing compete with live traffic, shared databases, and external provider quotas. The system may be "recovering" while actually extending the incident.
High-level dashboards usually hide the wrong thing. Publisher liveness is not publication freshness. Consumer uptime is not consumer correctness. Broker throughput is not side-effect safety. Some of the worst incidents stay green for a long time if those are the only graphs you trust.
Experienced teams add guardrails, but every guardrail has a tax. Age-of-oldest-outbox-row helps, but now you need per-topic and sometimes per-event-type thresholds. Replay modes that suppress external side effects help, but now handlers need explicit split behavior and operators need to know which mode is safe. Dedup metrics help, but now you need to distinguish healthy suppression from pathological reprocessing.
In mature systems, recovery tooling, re-drive controls, suppression switches, and partial rebuild workflows stop being scripts and start becoming a second operational product. Most teams realize that only after they are already depending on it during an incident.
The Four Reliability Questions Across the Workflow
Map producer truth, publication state, consumer duplicate tolerance, and recovery posture onto one end-to-end flow so operators can see where ambiguity remains.
Placement note: Place at the start of The Four Reliability Questions, Distinguished. This should feel like an operator's reasoning diagram, not a broker topology diagram.
First, what is durably true at the producer boundary?
Not what you hope downstream will do. Not what the business workflow eventually wants. What has actually crossed a local commit boundary. In an order flow, "order accepted" may be durably true. "Fulfillment started" may not be.
Second, what exactly is true about publication?
This question is usually blurred into the first one and should not be. Is publication merely intended, durably queued for relay, durably handed to the broker, or known to have reached a consumer boundary you trust? Those are different states. Many incidents come from collapsing them into one vague idea of "sent."
Third, what duplicate patterns must the consumer tolerate?
Exact same message twice is the easy case. Replayed historical events, retried handlers after partial side effects, semantically equivalent actions wrapped in different envelopes, and out-of-order updates are harder. Inbox mainly helps with the easy case unless designed carefully.
Fourth, what does recovery look like when reality is already half-applied?
This is the question most teams skip. After a crash, restart, backlog, schema change, or replay, can the system tell which work is missing, which is duplicated, and which should never be retried automatically? If not, the design is less reliable than it appears.
Recovery speed must be expressed against live traffic, fanout, and side-effect limits, not raw replay rate. And recovery success has to be defined in business terms, not transport terms. "All messages were republished" is not success if customers got duplicate notifications or entitlements ended in the wrong state.
The biggest mistake is treating exactly-once as a system property instead of a local boundary claim. You can get exactly-once effects inside a narrow storage or transaction boundary. Outside that, you are managing duplicates and ambiguity, not abolishing them.
Another mistake is using inbox as a substitute for business idempotency. It is not. If a ledger entry must be unique by transaction identity, that uniqueness belongs in business semantics, not only in message-processing memory.
A third mistake is publishing events that describe desired downstream outcomes rather than committed local facts. That makes replay dangerous and compensation confusing. Publish what became true, not what you hope the workflow will accomplish later.
Another mature-system mistake is forcing outbox and inbox everywhere for pattern consistency. That feels disciplined and often wastes money and attention. A low-value analytics event may not justify transactional outbox. A projection consumer may be fine with ordinary idempotent upsert logic. Pattern purity is not architecture judgment.
The more expensive mistake is subtler. Teams often treat getting the producer commit boundary right as the hard part and treat consumer truth as follow-up detail. In many real systems that is backwards. The order of pain is usually producer truth, then publication lag, then consumer side effects. The incident bill is usually largest in the third category.
Teams also size for steady-state flow and forget catch-up. They know average event rate. They do not know how long it takes to drain 30 minutes of backlog while live traffic continues, or what one week of dedup memory costs per consumer.
And they ask "did the event publish?" when the better question is "which truth is stale, duplicated, or misleading right now?" Publication can succeed while the system is still wrong in every way users care about.
Use outbox when a local write must reliably drive downstream propagation and losing that propagation is materially harmful. Order acceptance, payment-adjacent transitions, user registration, subscription lifecycle changes, and inventory-affecting commits are strong candidates.
Use inbox when duplicate consumer execution is materially costly or user-visible, and when transport-level duplicate memory is actually the right control point. Notification sending, entitlement grants, webhook processing, and ledger-adjacent actions often deserve serious duplicate control, but not always through the same mechanism.
Use both when producer-side ambiguity and consumer-side duplicate cost are both high, and when you are willing to pay for lag monitoring, storage lifecycle, replay tooling, and repair clarity.
The higher the consumer fanout, the more selective you should be about which consumers truly need durable dedup. Fanout multiplies not only delivery count, but also recovery coordination, retention burden, and debugging cost.
Do not use full outbox plus inbox everywhere by default.
If the event is low stakes, such as analytics telemetry where some loss is acceptable, outbox may be unnecessary.
If the consumer can express idempotency naturally in its own domain store, a separate inbox may add little value.
If the real problem is poor event design, weak ownership boundaries, or unclear downstream authority, adding outbox and inbox only makes the confusion more durable.
And if the team is not prepared to operate backlog, replay, cleanup, and observability, the elegant design is the wrong design. Reliability patterns you cannot steward are mostly aesthetic.
One more caveat matters at scale. If the producer database is already capacity-constrained, adding a high-write outbox table there may worsen the real bottleneck. If a consumer already struggles with high-cardinality write paths, adding durable inbox writes to every message may turn a correctness improvement into a throughput regression you cannot hide.
A junior engineer sees a dual-write problem and reaches for transactional outbox. A slightly more experienced engineer adds inbox and feels the system is now safe.
A senior engineer asks different questions.
What is the authoritative transition here?
What is true only in the producer, what is true about publication, and what is true in each consumer?
What duplicate patterns are realistic, not theoretical?
What ordering assumptions are hiding inside these consumers?
What is the safe replay story six months from now, under changed code and changed schemas?
What grows without bound if traffic or lag increases by 10x?
Which failure stays invisible longest while users are already feeling it?
Senior engineers also separate producer correctness from consumer correctness. Outbox can be right even when inbox is unnecessary. Inbox can be essential even when the producer does not need formal outbox. The boundary decisions are local. The consequences are systemic.
Most importantly, senior engineers treat backlog as part of correctness, not just performance. A system that is eventually right but operationally unrecoverable is not reliable.
They also do capacity math differently. Not "can the broker carry this?" but "what is the slowest durable boundary in steady state, what is its safe catch-up rate, and how long can the source of truth run ahead before the business starts lying to itself?"
And they debug differently. They do not stop at "the publisher is healthy" or "the consumer retried." They ask which boundary claimed authority too early, which side effect escaped idempotency, which ordering assumption just broke, and which freshness signal the team should have been watching before the ticket queue filled up.
Outbox and inbox are worth respecting precisely because they solve narrower problems than people claim.
Outbox is good at making publication intent durable after a local commit. Inbox is good at making repeated consumer execution bounded and observable, when that is actually the right consumer boundary. Both reduce real failure risk. Neither creates end-to-end exactly-once truth.
The questions that matter are sharper than the diagrams suggest. What became durably true? What is true about publication? What duplicate and ordering behaviors must consumers survive? What do backlog and replay do to recovery, storage, side effects, and operator confidence?
Use these patterns where correctness justifies the machinery. Avoid them where simpler boundaries are enough. The mature move is not adopting them everywhere. It is knowing exactly which ambiguity you are buying down, which ambiguity you are merely relocating, and how much backlog you can survive before the source of truth outruns the rest of the system.