Chat Architecture and the Delivery Guarantees That Actually Cost Something

Chat Architecture and the Delivery Guarantees That Actually Cost Something | ArchCrux

Core insight: Chat is an ordering and delivery problem with a real-time user interface on top. Transport matters. It is just not the part that usually betrays you first. The harder part is deciding what the product is allowed to imply, then paying the coordination cost required to make that implication mostly true.

“Sent,” “delivered,” “read,” “online,” “last seen,” “typing,” and even the visual order of message bubbles are claims about underlying events. Most of those claims are not facts. They are interpretations over distributed state that may be delayed, partial, duplicated, reordered, or observed differently across devices.

That is where most chat writing goes soft. It talks about low latency, websocket scale, and persistence. The real problem is deciding when the system knows enough to speak with confidence.

A strong judgment: if your UI says “delivered,” you are no longer describing transport. You are making a correctness claim. At that point, the backend is not just forwarding bytes. It is deciding what reality is safe to show.

Why This System Is Deceptively Hard#

At small scale, chat looks straightforward.

Open a persistent connection. Authenticate the user. Accept a message. Persist it. Fan it out to the other participants. Mark it read when the recipient opens the thread. Add push notifications when offline. Done.

The trouble is that none of those verbs are single events.

“Accept” can mean client queued locally, gateway acknowledged, service validated, or durable write succeeded. “Persist” can mean primary database commit, replicated commit, log append, or inbox materialization. “Fan out” can mean publish to a queue, enqueue per recipient, deliver to one active device, or update every user’s unread state. “Read” can mean thread opened, message rendered, viewport crossed, focus present, or last-read cursor advanced by another device. “Online” can mean currently connected, recently heartbeating, foregrounded, or merely not yet expired from a lease table.

A small example makes the problem concrete.

Two users, one conversation, three devices:

Alice sends message M1 from her laptop at 10:00:00.100. Bob’s phone is offline. Bob’s desktop is online but slow. Alice sends M2 at 10:00:01.200. Bob’s desktop receives M2 before M1 because a delayed fan-out retry lands late. Bob’s phone reconnects later and fetches both from storage in correct sequence. Bob reads the conversation on phone first. Alice sees “read” before Bob’s desktop has rendered M1.

Nothing above requires data loss. Only retries, reconnects, and multiple observation points.

The difficulty is not message storage. It is making concurrent, delayed, and partial observations behave like one conversation.

Non-obvious observation number one: the first user-visible correctness break in chat often appears in reconciliation, not persistence. The ledger can be fine while the experience still feels false.

The Decision That Defines Everything#

Diagram placeholder

Where “Sent,” “Delivered,” and “Read” Are Actually Produced

Make the delivery-guarantee spectrum explicit so each label maps to a different boundary with a different coordination cost.

Placement note: Place in The Decision That Defines Everything, right after the delivery and visibility boundary discussion.

The defining decision is this:

What semantics do you promise, and at which boundary do you stop paying for stronger truth?

Most chat systems are accidentally designed at exactly this layer. Product chooses a few clean labels. Engineering implements something “close enough.” Six months later, support tickets reveal that the UI has been claiming more than the backend can know.

There are four boundaries that matter.

1. The message acceptance boundary

When is a message officially “sent”?

Possible answers:

the client enqueued it locally the edge gateway accepted it the message service validated it the durable conversation log committed it the log replicated to a quorum the conversation sequence number was assigned

A senior answer is usually: do not call a message “sent” until it has a durable identity and authoritative position within the conversation.

Before that point, the client has intent. After that point, the system has shared state others can rely on.

This matters more than it sounds. Local echo is a UX device. Durable sequence assignment is a correctness event. Systems that blur those together create duplicate sends, ghost pending states, and support cases where a sender insists a message was sent because the bubble appeared.

At larger scale, this boundary gets stricter, not looser. Retries stop being edge cases. A one-to-one chat with one device per user may see occasional resend behavior. A product with three devices per user, mobile background wakeups, and intermittent network shifts sees duplicate-send pressure all day. The acceptance boundary must survive not just one retry, but many near-concurrent retries emitted by stale clients, reconnecting devices, and replay logic.

One ugly practical reality engineers learn late: duplicate-send bugs rarely announce themselves clearly. They show up as random chat weirdness in support until someone finally correlates them with reconnect paths.

2. The ordering boundary

What order is authoritative, and for whom?

This is where chat systems either become sane or expensive.

There are several possible meanings of order:

client creation time order server receipt time order durable append order per-conversation sequence order causal reply-aware order globally consistent total order

Most products only need authoritative order within a conversation, and even that needs to be defined carefully.

A defensible strong judgment: conversation-local order is worth paying for; global order almost never is. Users care deeply that messages inside a conversation stay stable and intuitive. They do not care whether two unrelated rooms are globally sequenced.

Trying to impose stronger-than-needed ordering is one of the fastest ways to buy coordination cost for no real user value.

The right question is not “can we totally order all messages?” It is “what is the smallest scope in which users will notice order violations?”

For most systems, the answer is:

strong order within a conversation shard no promise across conversations limited guarantees during local pending state before server assignment explicit tolerance for temporary mismatch during reconnect until authoritative ordering is reconciled

That last point matters. Ordering is not true everywhere at every moment. There is usually a transition window where local view and authoritative order differ. Good systems admit that and reconcile cleanly. Bad ones pretend the mismatch cannot happen.

Scale deepens this boundary in a way many teams miss. Direct chat and large-group chat do not merely differ in fan-out volume. They differ in how expensive stable order feels to preserve. In a one-to-one thread, sequencing one append stream and reconciling two users’ devices is usually affordable. In a 20,000-member room spread across regions, maintaining a clear authoritative sequence is still possible, but making every device observe that order quickly and consistently under lag, replay, and partial partitions is much harder. The order can be correct in storage while the observed arrival pattern is still messy.

That means the client must treat sequence as truth and transport arrival as suggestion.

Multi-region distribution adds another layer. If a conversation accepts writes in two regions at once without one authoritative ordering boundary, the system now has to merge concurrent writes in a way users can understand. That usually looks elegant on architecture diagrams and awkward in production. For most products, conversation authority should be region-pinned or leader-pinned, even if read paths are distributed broadly.

Arrival order is network history. Conversation order is product truth. Confusing the two is how clean systems turn into haunted ones.

3. The delivery boundary

What does “delivered” actually mean?

This is the most under-specified word in chat systems.

It might mean:

persisted for the recipient user enqueued for recipient devices pushed over active transport received by at least one recipient device acknowledged by all recipient devices available in the recipient’s inbox fetch path visible on at least one foregrounded device

These are completely different promises.

Suppose you run a one-to-one chat with multi-device sync. Bob has phone, tablet, and desktop. If “delivered” means “received by at least one Bob device,” then a message can be marked delivered while Bob’s primary phone has not seen it. If “delivered” means “written to Bob’s server-side inbox,” it may still not have touched any device. If it means “all Bob devices acked,” the system becomes slower, more fragile, and much more expensive.

Non-obvious observation number two: “delivered” is not a transport event. It is a product decision about how far through fan-out you are willing to assert success.

The more human-aligned your definition is, the more coordination you pay for.

At scale, the sent-delivered-read spectrum gets harder because the recipient is no longer one endpoint. It is a changing set of user devices, session leases, push paths, and fetch-based recovery paths. In direct chat, “delivered to at least one user device” may feel close enough to user intuition. In large-group chat, the same phrase becomes almost meaningless. Delivered to whom? One active device per user? Every online member? Every inbox?

Mature chat systems usually do not have one receipt model. They have a DM model and a channel model, because pretending those can share one meaning of “delivered” is how semantics become dishonest.

4. The visibility boundary

What does “read” mean?

This sounds straightforward until you have:

multiple devices push previews lock-screen notifications background preload conversation open but not focused large media attachments partial scroll positions

A good production definition is rarely pure. It is usually something like: a user’s read cursor may advance when an authenticated client has the conversation in focus and has rendered messages through sequence N.

That is not mathematically perfect. It is workable.

Meaningful caveat number one: if your product operates in regulated, audited, or high-dispute environments, approximate read semantics may be insufficient. “User opened the thread” and “user viewed message contents” are not the same thing.

Meaningful caveat number two: for large groups, per-message receipt precision is often far more expensive than the user value it creates. In a five-person support thread, rich receipt state may be worth it. In a 10,000-member channel, it often becomes an expensive status fiction.

The entire architecture flows from these decisions. Storage model, sequencing strategy, fan-out shape, receipt tracking, reconnect behavior, observability, incident response, and cost profile all follow from what you choose to mean.

Request Path Walkthrough#

A useful way to think about chat is to walk a message from user intent to user-visible truth and identify where correctness can split from appearance.

Consider a reasonably designed small-group chat system. Alice sends a message to a three-person conversation. Each participant has two devices on average.

Step 1: Client creates a provisional message

The client generates:

a stable client message ID conversation ID sender ID local creation timestamp optional reply-to metadata attachment references or placeholders

This client-generated ID is not authoritative order, but it is crucial for idempotency. If the network drops after submission and the client retries, the backend must be able to treat the second attempt as the same logical message.

This is the first place junior designs get into trouble. Without stable client identity, every reconnect becomes a chance to duplicate visible messages.

Step 2: Local echo renders pending state

The sender UI usually displays the message immediately with a pending marker. Good UX. Also a correctness hazard if the system later fails the write, remaps the order, or rejects the payload.

Senior engineers treat local echo as speculative UI, not message truth.

That sounds obvious until the client crashes between local echo and backend ack, and support asks whether the message “was sent.”

Step 3: Gateway accepts and forwards

The edge service authenticates the session, enforces rate limits, checks conversation membership, and forwards the request to the message service.

The gateway should not assign authoritative order. That choice belongs near durable conversation state. Otherwise a message can get a visible sequence before the durable write path commits it.

Step 4: Authoritative message write and sequence assignment

This is the first real correctness anchor.

The message service writes the message to a durable conversation log or conversation-partitioned store and assigns a conversation-local sequence number. For example:

conversation C123 next sequence = 48192

Now the system has:

durable identity authoritative order inside the conversation a stable point from which to deduplicate retries a unit for read cursors and gap detection

At this moment, the sender can reasonably be told “sent.”

A subtle but important point: do not use wall-clock timestamps as authoritative order. Use sequence numbers or a monotonic conversation position. Timestamps are display metadata. Once retries, offline sends, and clock skew show up, they are weak ordering instruments.

Step 5: Recipient resolution and inbox expansion

For a three-person conversation, fan-out is cheap enough that many systems materialize per-user inbox state:

Bob inbox: new message at seq 48192 Carol inbox: new message at seq 48192

Unread counters, thread previews, and notification candidates may also update here.

Notice what has happened: before any device receives the message, the backend may already know that it belongs in Bob’s unread set. If your product defines “delivered” at inbox materialization, this may be enough. If it defines delivered at device receipt, it is not.

The scale trap appears when teams assume this model extends cleanly to big rooms. It does not. In direct chat, eager recipient materialization is often cheap and semantically convenient. In large groups, eager per-user inbox expansion, per-device targeting, and per-message receipt tracking turn one append into thousands or millions of downstream mutations. The model still works. The economics do not.

Step 6: Device fan-out over active transports

Suppose Bob has:

desktop online phone offline

Carol has:

phone online tablet online but backgrounded

The fan-out system pushes the message to online sessions. Bob desktop acks transport receipt quickly. Carol phone acks. Carol tablet does not because the app is backgrounded and the connection has gone stale. Bob phone gets a push-notification path instead.

Which users count as delivered?

That depends on the delivery boundary you chose earlier. The backend now has multiple partial observations:

Bob as user has one active device receipt Bob as device set is incomplete Carol as user has at least one ack Carol as device set is incomplete

This is where many systems start quietly lying. They surface a clean “delivered” state while internally holding a messy set of partial downstream events.

As device count per user grows, this gets more expensive faster than teams expect. The number that matters is not just messages per second. It is messages multiplied by recipient users multiplied by active devices multiplied by retries and reconnect-induced duplicates. A design that looks light at 1.2 devices per user behaves very differently at 2.8 devices per user, especially when background reconnect logic is aggressive.

Step 7: Offline path and reconnect handling

Diagram placeholder

Authoritative Message Path vs Observed Device Reality

Show that durable sequencing, live fan-out, offline replay, and receipt propagation create different truths at different times even when the core message path is correct.

Placement note: Place in Request Path Walkthrough, immediately after Step 6 or just before Step 7.

Bob’s phone comes online 12 minutes later after moving from subway Wi-Fi to LTE. It missed direct transport delivery, so it performs a gap-sync fetch:

last known conversation sequence on device = 48187 server says latest = 48192 device fetches missing messages 48188 to 48192

If the offline path is not sequence-based, this is where order often breaks. Fetch-by-time or fetch-by-last-message-ID hacks work until they do not.

Network transitions create the ugliest correctness edges:

websocket disconnect without clean close client resends while the server is still processing the original send stale device cursor from a previous app version push arrives after direct transport already delivered reconnect fetch races with live socket replay

Non-obvious observation number three: reconnect logic is often where a “real-time system” reveals whether it is actually a log-reconciliation system underneath. Healthy systems treat reconnect as gap detection and state repair, not vague refresh.

Offline duration changes system pressure in nonlinear ways. A device offline for 30 seconds usually needs a small replay window. A device offline for 8 hours may need backlog fetch, unread recomputation, push suppression cleanup, and receipt reconciliation. When many devices come back together after commute hours, flight landings, or regional mobile recovery, replay backlog becomes a first-class capacity problem. The system is no longer limited by steady websocket count. It is limited by catch-up throughput and metadata repair.

The bug report will say “messages out of order.” The root cause is often a reconnect path nobody tested honestly under lag, replay, and stale client state at the same time.

Step 8: Read progression

Carol opens the conversation on phone and reads through sequence 48192. The client emits read cursor = 48192. The backend stores:

Carol last-read seq in conversation C123 = 48192

This is usually better than writing per-message read rows. It is compact, monotonic, and easier to reason about.

Now consider Bob. He reads on desktop through 48191 but not 48192, then later opens phone and fully reads. The read path must reconcile across devices without regressing.

A good invariant is that user-level read cursor is monotonic. Devices can report their local read position, but user-visible read state should never move backward.

At larger scale, read progression stops being a simple correctness field and becomes a write-shaping decision. In a DM product, per-user read cursors are cheap enough. In a large group with heavy traffic, read propagation can dominate backend churn. The difference between “store one cursor per user per conversation” and “update receipt state for every message visible in viewport” is the difference between a manageable metadata system and a receipt storm.

Step 9: Sender-side receipt rendering

Alice sees status progression:

pending sent delivered read

Each status should be backed by a precise rule.

For example:

sent: durable write + sequence assigned delivered: at least one recipient user has at least one device ack or inbox materialization, depending on product definition read: every recipient user has advanced read cursor past the message sequence, for small groups only

That “for small groups only” matters. In large groups, exact read display becomes much more expensive and less meaningful.

Small-scale concrete example

Take a product with 80,000 daily active users, where most conversations are direct messages. Assume:

35 messages per active user per day average recipients per message: 1.05 average active devices per user: 1.6 average reconnects per device per day: 9

That is roughly 2.8 million messages per day. Raw message storage is easy. The real work looks more like this:

about 2.8 million authoritative appends about 3 million recipient inbox effects roughly 4.8 million live device delivery attempts millions of presence heartbeat writes or lease refreshes tens of millions of replay cursor checks over the course of the day read cursor writes that may outnumber sends in highly active users

At this size, the first real bottleneck is often not socket count. It is receipt propagation, presence churn, or replay lookups hitting hot users with many devices.

Larger-scale concrete example

Now take a large collaborative product with mixed DMs and big channels:

6 million daily active users 110 messages per active user per day 660 million messages per day average recipients per message across DMs and groups: 4.2 average active devices per recipient user: 2.3 12 percent of messages go to channels with more than 200 members 1 percent go to rooms with more than 5,000 members

One message into a 5,000-member room may create:

1 authoritative append 5,000 inbox impacts or lazy reference candidates 2,000 to 3,500 live fan-out attempts depending on online rate thousands of unread updates push scheduling for offline members replay work later for devices that were absent receipt aggregation work if the product insists on “delivered” or “seen” semantics

That is where chat stops resembling request-response messaging and starts resembling a distributed notification and state-reconciliation platform with a conversation UI on top.

Where the Architecture Hides Debt#

The debt is rarely in the message table. It hides in derived state that sounded small when the feature was named.

Presence

Presence looks ephemeral. In production it is a low-grade lie that must be managed carefully.

Most systems do not know whether a user is online. They know whether a recent heartbeat exists, whether a connection lease is still valid, or whether a session has recently performed activity.

That means:

a user can appear online for 30 to 90 seconds after disappearing a user can appear offline even though they just reconnected through a new path mobile backgrounding can freeze heartbeats in awkward states network handoffs can create duplicate overlapping sessions aggressively fresh presence increases battery and connection churn

A common engineering mistake is treating presence as exact truth and then letting product wording become too literal. “Active now” is safer than “online.” The former admits approximation. The latter sounds binary.

Presence also gets much more expensive once the product becomes multi-device by default. Heartbeats, lease expirations, reconnection noise, and regional presence replication can create more write traffic than some teams’ message path during quiet periods. That is why presence is often the first “small feature” that turns into infrastructure.

Presence should also not be used as a strong routing or truth signal. A stale lease is enough to lie to the UI, and it is absolutely enough to make a routing decision look foolish.

Read receipts

Per-message read state feels intuitive early. It becomes debt later.

In a direct message thread with a few hundred messages, per-message rows are tolerable. In a high-traffic room with millions of messages and many members, per-message read tracking becomes write amplification disguised as UX polish.

The scalable model is usually cursor-based:

per user, per conversation, store highest read sequence derive message read state relative to that cursor

This has limits. If your UI genuinely requires non-contiguous read semantics, partial-message analytics, or evidentiary “viewed” trails, cursors may be too lossy. But for most chat, cursors are the right default.

What teams underestimate is not just storage volume. It is propagation cost. Once read must be reflected on sender devices, recipient devices, inbox previews, unread counters, analytics, and perhaps other participants in small groups, the receipt path can rival or exceed the original send path.

Receipts are also write paths in their own right. They need idempotency, monotonicity, replay discipline, and observability. Teams that treat receipts as decorative metadata usually discover too late that they built a second distributed state machine and forgot to operate it like one.

Unread counts

Unread counts are surprisingly fragile. They depend on:

authoritative sequence progression monotonic read cursor advancement device convergence membership changes muted vs unmuted policy whether mentions count differently whether deleted messages still influence cursor arithmetic

Unread drift is one of the most common “everything is healthy but users are angry” problems in chat.

Push notifications

Push looks like a side channel. It becomes part of delivery semantics the moment the product relies on it to make offline chat feel live.

Push vendors are external distributed systems with their own throttling, collapse behavior, and delays. A notification accepted by APNs or FCM is not a delivery event to the user. Treating it as one is semantically sloppy.

Membership and fan-out policy

Group membership changes interact badly with message visibility.

What if a user is removed from a room while offline, and a reconnect fetch races with membership revocation? What if a device cached unread state from before removal? What if a push for a newly unauthorized message was already queued?

Debt appears wherever authorization, fan-out, and cached device state meet.

Capacity and Scaling Behavior#

Scale in chat is rarely about raw storage first. It is about multiplicative metadata and skewed access patterns.

The naive mental model says more users means more sockets and more messages. The production model is harsher: more users means more devices, more partial sessions, more replay windows, more receipt churn, more presence updates, and more hot conversations that concentrate traffic into narrow shards.

Infrastructure cost rises. User-visible correctness pressure rises faster. People notice inconsistency in chat immediately. A feed can be stale and remain usable. A conversation that says “read” when it was not read, or renders messages in the wrong order for fifteen seconds, feels untrustworthy at once.

Small-scale example

Suppose a product has:

100,000 daily active users 40 messages per active user per day 4 million messages/day average conversation size of 2.4 participants average 1.8 active devices per reachable user

The raw message volume is manageable. The hidden load is not just 4 million writes. It is closer to:

4 million authoritative appends 9 to 12 million inbox updates 12 to 15 million device fan-out attempts millions of read cursor writes millions of presence lease updates push notification scheduling for missed online deliveries retries, reconnect replays, and duplicate suppression lookups

Even here, the system cost is mostly not “storing messages.” It is the state transitions around them.

In direct chat at this size, the system can still look healthy because fan-out is narrow and most pressure comes from multi-device behavior. The first bottleneck may be receipt propagation or replay cursor lookups, not the append path. That is why teams sometimes scale websocket fleets successfully while still getting support tickets about unread drift and duplicate sends.

Larger-scale example

Now consider:

5 million DAU 120 messages per active user per day 600 million messages/day average recipients per message across DMs and groups: 3.8 average active devices per recipient user: 2.1 15 percent of messages land in high-fan-out groups where recipient count is 50+

The raw append path might still be architecturally clean. The fan-out and metadata path becomes the hard part.

A plausible daily shape:

600 million authoritative message writes 2 to 3 billion recipient-visible inbox effects several billion device delivery attempts receipt and read metadata volumes large enough to rival payload-path writes massive hot-spotting around popular rooms, creator chats, support incidents, or company-wide channels

This is why strong semantics get expensive.

The direct-chat architecture that looked fine earlier becomes fragile here. Eager per-recipient inbox materialization, rich receipt propagation, and device-aware delivered semantics all multiply across large groups. What felt like a clean model for one-to-one chat becomes a cost amplifier when fan-out and reconnect rates rise together.

Hot partitions

One active conversation can dominate a shard. Human attention creates burst concentration that looks nothing like average traffic.

An incident room with 20,000 members can produce:

intense append bursts huge subscriber fan-out unread churn high reconnect pressure if users are mobile many clients sitting on one hot conversation partition waiting for live updates

You do not get to average that away.

Metadata can dominate payload

A short text message might be 200 bytes. The system work around it can be much larger:

per-device routing per-user inbox deltas unread recomputation receipt updates notification scheduling presence joins and leaves analytics and abuse checks

Non-obvious observation number four: in mature chat systems, metadata churn often becomes more expensive than the message payload path itself.

Ordering cost rises non-linearly

Per-conversation ordering is generally affordable because it scopes contention to one shard or logical sequencer. Stronger-than-conversation ordering expands the coordination domain, which increases tail latency and failure sensitivity.

The subtle scaling issue is not just assigning sequence. It is preserving the user’s belief that sequence is stable across devices and regions under replay. A message may be correctly ordered in storage but still appear temporarily misordered on some clients if live fan-out, backlog replay, and local pending state are not reconciled carefully.

At that point the ordering bottleneck is not the sequencer. It is the surrounding ordering metadata and recovery path.

Architecture diagrams hide this badly. They usually hide replay backlog as a scaling dimension, receipt propagation as a competing write path, presence as a lease system rather than a fact store, and channel fan-out tail rather than median as the semantic failure driver.

Failure Modes and Blast Radius#

Diagram placeholder

Reconnect Failure Chain: Correct Log, Broken Experience

Show how a chat system can be durably correct while still producing out-of-order rendering, stale receipts, and cross-device inconsistency.

Placement note: Place in Failure Modes and Blast Radius, under the reconnect and out-of-order visibility failure chain.

The most dangerous chat failures are not always dropped messages. They are partial semantic failures that spread across user-visible state.

Failure chain 1: offline recipient, network transition, replay gap, and out-of-order visibility

This is the failure most teams discover late because the message ledger looks healthy.

A recipient phone goes offline on unstable transit connectivity. The sender sends M1, then M2. The server durably appends them in correct sequence. The recipient desktop gets M2 first because a live fan-out retry for M1 is delayed. Later the phone reconnects, fetches both from storage, and renders them in correct order. The desktop finally replays M1. Meanwhile the sender sees “sent,” and depending on the product definition may even see “delivered.”

The early signal is usually small: a rise in gap-fetch requests, resend suppression hits, and client-side reorder corrections. Support may hear “messages came in wrong order” before engineers see anything dramatic.

The dashboard often shows websocket reconnects rising, fan-out worker latency creeping upward, or replay queue depth inching higher. Nothing looks catastrophic.

What is actually broken first is ordering truth at the edge of observation. The log is correct. The device-visible sequence is not. If the client trusts arrival order, or advances receipts before replay catch-up completes, the UI starts telling a story the conversation log does not support.

Immediate containment is blunt and correct: clients that detect a sequence gap should stop trusting live arrival order, suppress optimistic delivery-state advancement until the gap is closed, and switch affected conversations into fetch-and-reconcile mode rather than continuous live rendering.

The durable fix is protocol-level. Reconnect needs explicit sequence-gap detection, ordered replay windows, and receipt gating that effectively says: do not claim delivered or read past a gap you have not closed.

Longer-term prevention means treating reconnect as a first-class correctness path, not a transport afterthought. Test network handoff, background wakeup, duplicate resend, and partial replay under load. The moment you have multi-device chat, replay correctness is part of the product contract.

Systems do not usually lose the message first. They lose the story around the message.

Failure chain 2: sender sees “sent,” but the system cannot yet honestly say “delivered”

This failure is quieter and more corrosive.

A sender writes a message successfully. The durable append succeeds in 40 ms. The UI flips from pending to sent. Downstream fan-out is delayed because the recipient is offline, push handoff is slow, and inbox materialization for one shard is backlogged. Nothing justifies “delivered” yet. But product language or sloppy state mapping makes sent feel emotionally equivalent to delivered.

The early signal is divergence between durable-append latency and time-to-first-recipient-observation. Write latency stays healthy while first-device-ack latency or inbox materialization latency stretches badly.

The dashboard usually shows the send path green. That is the trap.

What is actually broken first is semantic progression, not transport. The system knows the message exists. It does not yet know that any recipient surface has observed it. If the UI blurs those two states, the user experiences a false promise.

Immediate containment is to narrow the claim. Preserve “sent,” but stop auto-promoting to “delivered” from weak signals like queue enqueue or push acceptance. In some products it is better to show less state than wrong state.

The durable fix is to define delivery from a real downstream boundary and instrument it separately. Sent means durable accept. Delivered means something that happened after recipient-side progress, not merely after sender-side success.

Longer-term prevention is product and backend alignment. If the company wants a delivered badge, engineering must define exactly what event makes it true and how long it can remain unknown without misleading users.

Failure chain 3: read receipts race ahead of replay or diverge across devices

This is the kind of failure that creates support tickets that sound impossible.

A user reads a thread on desktop through sequence 900. Their phone, which was offline, reconnects with stale state and begins replay from 870. The phone opens the conversation and emits a read cursor based on local render state or thread-open semantics. Meanwhile the desktop had already advanced the user-level read cursor. In another conversation, replay is still catching up while a read signal is sent prematurely from a device that has not yet materialized all intervening messages.

The early signal is read cursor regression attempts, read-before-delivery anomalies, or a rise in clients whose rendered max sequence lags their emitted read max sequence.

The dashboard often shows none of this unless the team built semantic observability. Transport metrics may look fine.

What is actually broken first is causal integrity of read state. The system is letting a weaker observation outrun a stronger one, or letting one device speak for visibility before it has reconciled backlog.

Immediate containment is monotonic enforcement. Never allow user-level read cursors to move backward. Reject or clamp read updates that exceed the device’s acknowledged replay frontier. In active incidents, it is often safer to delay read propagation than to publish incorrect read state.

The durable fix is to separate device-local render progress from user-visible read truth. A device can know what it rendered. The backend should only advance the shared read cursor when the device has proved it is caught up through a contiguous sequence frontier.

Longer-term prevention means making read rules explicit in the client protocol. Read is not “thread opened.” Read is “conversation in focus and contiguous render frontier advanced through N.” Anything weaker becomes a source of semantic debt.

Failure chain 4: presence becomes stale or misleading during network transitions

Presence is the feature most teams underestimate because it looks lightweight and feels secondary until users interpret it as truth.

A user moves from Wi-Fi to cellular. The old connection lingers in the presence system because the lease has not expired. The new connection is slow to register or lands in another region whose replication is delayed. For 20 to 60 seconds, the user may appear online on one device and offline on another, or appear online while direct message delivery still falls back to offline handling.

The early signal is a rising tail on presence lease expiry, duplicate session overlap, or disagreement between active transport count and user-visible online state.

The dashboard often shows healthy websocket counts and only mild presence store lag.

What is actually broken first is not connectivity. It is the mapping from partial liveness signals to the product word “online.” The system is turning a lease heuristic into a user-visible fact.

Immediate containment is to degrade wording and suppress overconfident state. “Active recently” is safer than “online now” when the presence backend is lagging. During incidents, lengthening presence smoothing windows can reduce visible flapping, though at the cost of freshness.

The durable fix is to model presence as expiring confidence, not exact binary truth. Separate session liveness, recent activity, and user-visible presence. Do not let one stale heartbeat dominate the entire experience.

Longer-term prevention means product discipline. Presence should not imply delivery availability or message visibility. Users make that leap on their own. Engineering should not reinforce it unless the backend can support it.

Presence is one of the easiest places to ship a lie with a green dashboard.

Failure chain 5: group fan-out is mostly successful, but not uniformly successful

Large groups fail asymmetrically.

A message is posted to a 5,000-member room. The append succeeds. Ninety-two percent of currently online recipients get it within 300 ms. A minority on one overloaded fan-out worker or one slower region receive it 8 to 20 seconds later. Some offline recipients only see it on next fetch. During that window, part of the room is already discussing the message while another part has not seen it yet. If reactions or receipts are visible quickly, lagging users may see reactions to a message that has not appeared for them.

The early signal is widening fan-out latency percentiles by conversation size, region, or worker partition, not by global message-send rate.

The dashboard often shows acceptable average delivery times. The problem is in the tail.

What is actually broken first is shared conversational reality. The room no longer has one present tense. Most recipients have progressed. A minority are temporally behind, sometimes far enough behind that follow-up messages appear nonsensical.

A lagging minority is not just a tail-latency problem. Once replies, reactions, or seen state get ahead of message visibility, the room has split into different realities.

Immediate containment may mean throttling non-essential per-message side effects in large rooms, such as rich receipt propagation or real-time seen updates, to preserve core message fan-out. In severe cases, it is better to prioritize message delivery and defer secondary metadata.

The durable fix is architectural. Large-group fan-out should not be treated as “direct chat, but more recipients.” It needs different economics: tiered fan-out, lazy inbox materialization where possible, aggressive backpressure control, and explicit fairness so hot rooms do not starve everything else.

Longer-term prevention requires per-conversation-size strategies, not one universal pipeline. Small groups and big channels should diverge earlier in the architecture than many teams expect.

The first real failure is rarely websocket count

Teams often spend early effort on socket fleets, connection pooling, and keepalive tuning. That work matters. It is usually not where trust breaks first.

The early signal of real trouble is more often:

replay backlog age rising receipt propagation lag widening gap-fetch frequency climbing read anomalies appearing presence lease disagreement increasing fan-out tail latency separating from median

The dashboard may still look acceptable if it is built around connection health and send throughput.

What is actually broken first is user-visible coherence. Messages exist, but order is unstable. Reads are advancing too early. Presence is lying. Delivery semantics are more optimistic than the system’s actual knowledge.

Immediate containment is to narrow claims and simplify behavior under degradation. Preserve durable send. Preserve ordered replay. Delay or suppress secondary semantics when the system is uncertain.

The durable fix is observability and architecture aligned to the product contract, not just the transport path.

Longer-term prevention is cultural. Operators and product teams need to agree that a chat system can be “up” and still be broken if it is speaking falsely.

If the message log is right but the receipts are wrong, users will say chat is broken. They will be right.

The expensive part of chat is not sending a message. It is earning the right to describe what happened next.

Production realism: the painful chat incidents are often the ones where every individual subsystem is “mostly working.” The write path is healthy. Fan-out is degraded but progressing. Push is delayed, not dead. Presence is stale, not broken. From an infrastructure viewpoint, the platform looks wounded but available. From a user viewpoint, that is often the worst state because the system is confidently showing half-truths.

Trade-offs#

There is no free strong guarantee in chat. Every stronger promise buys some combination of extra coordination, write amplification, storage churn, battery impact, or operator burden.

Stronger ordering

Benefit: stable conversation experience, easier replay, cleaner unread and reply semantics Cost: localized sequencing bottlenecks, potential shard hot spots, more careful retry logic

This trade is usually worth it.

Global or cross-room order

Benefit: conceptual simplicity in architecture diagrams Cost: broad coordination domain, worse tail latency, much larger blast radius

This is usually a bad trade unless the product has a rare requirement for globally ordered event streams.

Per-message read receipts

Benefit: intuitive small-group UX and precise support or debug visibility Cost: very high metadata writes in active or large conversations

This is overkill unless the product materially depends on message-level accountability or conversation sizes remain consistently small.

High-fidelity presence

Benefit: chat feels lively and immediate Cost: heartbeat load, mobile battery drain, stale-state edge cases, user-visible false precision

Most systems should choose good-enough presence, not exact presence.

Synchronous deep fan-out

Benefit: cleaner end-to-end latency in small groups Cost: send-path sensitivity to large groups, slow recipients, and downstream failures

For large-scale chat, this often becomes a trap. It keeps latency pretty until one hot room makes it ugly for everything.

Multi-region active-active semantics

Benefit: local write latency and regional resilience Cost: harder ordering, harder replay, harder dedupe, uglier merge semantics

A strong defensible judgment: for most chat products, authoritative conversation sequencing should be pinned to one region or leader at a time. Elegant active-active symmetry looks attractive on whiteboards and becomes semantically awkward in the exact moments users need the system to feel stable.

What Changes at 10x#

At 10x, the architecture stops being about basic capability and starts being about scoping truth.

The sharpest change is not that socket count gets bigger. It is that replay, receipts, and fan-out tail start governing the product more than the happy-path send. A system that once behaved like a low-latency messaging service starts behaving like a distributed reconciliation system. Device count per user rises. Offline backlog matters more. Reconnect storms stop being rare. Direct messages remain relatively manageable, but large groups force different economics and different semantics. Preserving one coherent room matters more than preserving one fast path.

That is why mature systems compress rather than expand their promises at 10x. They keep ledger truth strict. They keep replay correctness strict. Then they get selective about what else deserves immediate precision.

1. Conversation becomes the natural partition

You stop asking whether chat data can be sharded and start asking whether your shard boundary matches the thing users expect to be ordered.

The answer is almost always the conversation or channel.

2. Read state becomes aggressively compact

Per-message receipt models become too expensive across the board. Systems move toward:

per-user read cursors compaction-friendly unread derivation summarized large-group visibility selective richer semantics only for small groups or premium workflows

3. Fan-out becomes tiered

Small groups can still receive relatively rich per-device handling. Large groups often require a different model:

append once to channel log materialize references lazily push live updates to currently subscribed members let inactive members fetch on demand aggregate “seen” or presence rather than track fully individualized device truth

This is one place where mature systems separate DM semantics from channel semantics instead of pretending one model fits both.

4. Reconnect becomes a first-class protocol

At small scale, reconnect is treated as a routine socket concern. At 10x, it becomes protocol design:

stable client message IDs sequence gap detection replay windows idempotent resend semantics explicit cursor reconciliation server hints for fast catch-up

Healthy large systems treat reconnect as state repair, not best-effort continuation.

5. Operational semantics matter more than elegance

At 10x, the best design is often the one that makes incidents understandable. Simpler semantics with honest UI language usually beat theoretically stronger guarantees that operators cannot reason about when half the fleet is degraded.

Operational Reality#

Real chat systems are constantly being perturbed by mobile backgrounding, NAT expiry, push delay, region-local backlog, and synchronized reconnect waves. The point is not that these things happen. The point is that they happen while users are staring at a conversation and assigning meaning to every badge and bubble.

That means observability has to be semantic, not merely infrastructural.

Useful signals include:

durable send acceptance latency sequence assignment latency time from durable write to first recipient device ack conversation gap-fetch frequency duplicate-send suppression rate user-level read cursor regression attempts stale presence lease distribution mismatch rate between unread badge and server cursor push accepted versus actually followed by inbox fetch hot conversation skew metrics replay backlog age for reconnecting devices receipt propagation lag by conversation size percentage of live arrivals that required client reorder correction share of read events emitted before contiguous replay completion

A useful production habit is to ask three questions during every chat incident.

First, what is the earliest user-visible symptom? Second, what metric saw it first? Third, what truth claim did the product make that the backend could not currently support?

That discipline keeps teams from treating semantic incidents as generic infrastructure noise.

Operators also need a degradation order. Protect ledger truth first. Preserve replay correctness second. Degrade presence, typing, rich receipts, and large-room seen state before you sacrifice message ordering or backlog recovery. Teams that get this backwards often keep the UI lively while the conversation itself becomes incoherent.

You can keep p95 send latency green for hours while users are already comparing screenshots of contradictory chat state.

Another hard lesson: client behavior is part of the protocol whether the server team likes it or not. A buggy client that resends without stable IDs, emits read too early, or mishandles replay can manufacture what looks like a distributed backend incident.

The operator’s job is not merely to keep the pipeline alive. It is to keep user-visible coherence intact long enough for the system to deserve trust.

Common Mistakes Engineers Make#

Using timestamps as the ordering source instead of as metadata

This is how messages that are correct in storage become wrong on screen. Clock time is useful for display and debugging. It is too weak to anchor conversation truth under retry, offline creation, resend, and replay.

Letting “sent” collapse into “probably delivered”

This usually happens because the write path is clean and the downstream path is messy. Teams start with one status, add another, and quietly map backend convenience to stronger product language. Users then interpret a sender-side success event as recipient-side progress.

Using push acceptance as a quiet proxy for delivery truth

This is one of the most common semantic cheats in chat systems. APNs or FCM accepted your notification. Fine. That proves handoff to another distributed system. It does not prove that the recipient has a visible message anywhere that users would call delivered.

Allowing read to be emitted from UI state that is not replay-safe

“Thread opened,” “screen visible,” or “viewport mounted” are tempting shortcuts. They also create impossible-seeming bugs when read outruns contiguous replay or diverges across devices.

Treating multi-device as a sync feature instead of a correctness boundary

Once a user has phone, desktop, and tablet, read, delivered, unread, and presence all become cross-device claims. Teams that postpone this usually end up rewriting their receipt model under pressure.

Reusing DM semantics for large groups

A receipt model that is coherent and affordable in a direct message becomes expensive theater in a 5,000-member room. Large groups are not just bigger DMs. They are a different economic and semantic system.

Measuring connection health more carefully than semantic health

It is common to know socket counts, p95 send latency, and reconnect rates in detail while having almost no visibility into reorder correction rate, read-before-replay anomalies, or receipt lag by room size. That is monitoring the pipe and ignoring the product.

Letting large-room receipt work contend with core fan-out

Seen updates, presence propagation, rich receipts, typing indicators, and unread recomputation all feel small. Under pressure, they can steal throughput from the one thing users actually need first: the message. This is how a system preserves activity indicators while quietly losing the room.

Assuming “mostly delivered” is good enough for conversational truth

This is acceptable in some notification systems. In chat, a lagging minority can distort the room for everyone else. If one part of a group is responding to a message another part cannot yet see, the system has already started lying.

When To Use#

This architecture style is the right fit when:

the product is truly conversational and message continuity matters users regularly move across devices offline delivery is a normal path, not an edge case order violations are visible and harmful read or delivery semantics affect user trust groups are large enough that fan-out shape matters the business value depends on users believing the system’s state transitions

It is especially appropriate for team collaboration, support messaging, marketplace communication, incident coordination, field operations, and products where delayed or misleading chat state produces real confusion or financial cost.

When NOT To Use#

Do not build the full semantic machine if the product does not need it.

If the use case is closer to lightweight social commenting, low-stakes messaging, notification threads, or loosely ordered community chatter, you may be better off with weaker but cheaper guarantees:

approximate activity indicators instead of exact presence “new activity” instead of strong delivered semantics thread-level seen markers instead of per-message read eventual conversation ordering after reconciliation rather than immediate precision fetch-first catch-up rather than elaborate live fan-out guarantees

Senior teams are often better not because they add more guarantees, but because they know which guarantees the product can safely avoid.

How Senior Engineers Think About This#

Senior engineers start with the contract, not the socket.

They ask:

What exactly does each UI state imply? Which event makes a message real? Where is ordering authoritative, and where is it only approximate? What can be monotonic? What can be derived? What can be delayed without violating user trust? What breaks first during reconnect storms and fan-out lag? Which semantics are worth paying for in large groups, and which are not?

They also separate three layers very clearly.

1. Ledger truth

The durable message record and authoritative conversation position.

2. Distribution progress

Which users, devices, queues, or external systems have observed that truth.

3. User-visible interpretation

The labels, badges, ordering, and presence indicators shown in the UI.

Those layers should not degrade the same way. Protect ledger truth first. Preserve replay correctness second. Let interpretation become less ambitious before the core conversation becomes less coherent.

That separation is where real judgment comes from. Teams that collapse those layers into one vague “message state” model usually build systems that demo well and fail strangely.

The senior instinct is not to maximize guarantees. It is to make guarantees legible, scoped, monotonic where possible, and cheap enough to survive real traffic and bad networks.

Chat is not a transport problem with some UI on top. It is a conversation-truth problem with a transport system underneath.

A chat system does not fail only when messages disappear. It also fails when the product speaks with more certainty than the backend possesses.

The hardest part of chat is not moving a message quickly. It is deciding which transitions are allowed to become truth in front of the user.

Summary#

Chat architecture is easy to underestimate because the visible surface is simple. A few bubbles, a status icon, a badge, a green dot. Behind that surface sits a constant negotiation between order, delivery, visibility, retries, reconnects, offline paths, and fan-out scale.

A good chat system chooses a clear message acceptance boundary, an honest ordering scope, a defensible definition of delivered, a workable definition of read, and a presence model that does not pretend to be more exact than it is. It treats reconnect as state repair, multi-device as a first-class correctness problem, and large-group fan-out as a different system shape, not just a larger number.

Users do not experience your internal event graph. They experience what your product dares to claim. In chat, every badge and bubble is a wager that the backend knows enough to say it out loud.

The rest is for members.

Ad Delivery Architecture: Latency Budgets, Auction Design, and Spend Accuracy

Cache-Aside: Why It Works, Where It Breaks

The Observability Pipeline and What Happens When It Becomes the Bottleneck

Service Mesh: When the Abstraction Helps and When It Just Moves the Complexity

Metrics Systems and the Alert Fatigue They Create by Default

Where “Sent,” “Delivered,” and “Read” Are Actually Produced

The defining decision is this:

There are four boundaries that matter.

1. The message acceptance boundary

2. The ordering boundary

3. The delivery boundary

4. The visibility boundary

Step 1: Client creates a provisional message

Step 2: Local echo renders pending state

Step 3: Gateway accepts and forwards

Step 4: Authoritative message write and sequence assignment

Step 5: Recipient resolution and inbox expansion

Step 6: Device fan-out over active transports

Step 7: Offline path and reconnect handling

Authoritative Message Path vs Observed Device Reality

Step 8: Read progression

Step 9: Sender-side receipt rendering

Reconnect Failure Chain: Correct Log, Broken Experience

1. Conversation becomes the natural partition

2. Read state becomes aggressively compact

3. Fan-out becomes tiered

4. Reconnect becomes a first-class protocol

5. Operational semantics matter more than elegance

1. Ledger truth

2. Distribution progress

3. User-visible interpretation