This is where the design actually lives. Not in the cache topology. In what the system has promised by the time it returns success, and what it can still defend afterward.
Write-through
The usual flow is simple:
client sends update
service validates and writes origin
origin commit succeeds
service updates or invalidates cache
service returns 200
If origin commit takes 10 to 14 ms, invalidation adds 1 to 2 ms, and service overhead is another 2 to 3 ms, p50 may land around 14 to 19 ms. Under contention, p99 can easily move into the 80 to 150 ms range.
That is the visible cost.
What the pattern buys is narrower semantics. By the time the caller sees success, the source of record has accepted the write. If cache update fails afterward, you have stale-read risk, not acknowledged-write loss. Those are very different incidents.
The first bottleneck is usually the synchronous origin path, not the cache:
lock contention on hot rows or partitions
WAL or redo flush pressure
replication acknowledgment delay
index maintenance
invalidation fanout when one logical write touches many keys
At 500 writes per second, those costs may be background noise. At 5,000, one "simple" update may imply primary commit, replication, CDC emission, index work, and several invalidations. Write-through often hits synchronous write amplification before it hits anything cache-specific.
For primary state, that is still usually the correct default. Money movement, entitlements, compliance-sensitive settings, order transitions, and records users will later cite as "I saved this" should usually pay foreground cost rather than hide it.
That is a judgment call, not a theorem. If your product contract really says "accepted for processing" rather than "saved," weaker semantics can be fine. But product wording is part of the consistency model. A button label, webhook, receipt email, or downstream event often commits you to semantics stronger than the storage team thinks it is providing.
The caveat is narrower than people think. Write-through is honest about origin commit, not necessarily about all readers. If reads come from lagging replicas, search indexes, other regions, or downstream materializations, the guarantee stops at the primary boundary unless the product contract says otherwise.
There is also a real operational cost people understate here. Teams fix correctness at the write boundary, then discover that synchronous invalidation fanout has become the next bottleneck. That is still usually a better problem than proving whether the write ever existed, but it is not free.
Write-behind
The common flow is:
client sends update
service validates
service updates cache, appends to a queue, or both
service returns 200
workers materialize later
Now p50 may drop to 2 to 5 ms.
This is where teams start confusing speed with completion.
The question is not whether the request was fast. The question is what happened before step 4.
There are three materially different answers.
Case 1: Cache updated, no durable handoff
This is the dangerous form.
The cache shows the new value. The caller sees success. No durable store has committed the write, and no durable queue has recorded intent.
If the process dies, the write can disappear.
That is not a weaker durability contract. It is speculative state presentation.
Warning: if success is returned after cache update but before durable handoff, the system is allowed to show users state it never durably possessed.
Some teams do this intentionally for low-risk pending state. That can be acceptable, but only if it is explicitly presented as pending and no downstream system treats it as committed. The mistake is not speculation. The mistake is speculation dressed up as completion.
Case 2: Durable queue append succeeded, origin flush deferred
This is the form that can be justified.
The service returns success only after appending a mutation to durable storage. Origin is updated later.
Now the contract is weaker, but coherent: "your intent is durably recorded and will be materialized."
That boundary is still narrower than many teams assume. Durable queue append only buys safety if the queue’s retention, acknowledgment, replication, and replay semantics are adequate for the invariant you care about. A durable queue that cannot preserve the order your invariant needs is still a weak truth boundary.
The important classification here is not feeds versus counters versus metadata. It is invariant type. Some state is rebuildable. Some is externally observable. Some is monotonic. Some is mergeable. Some is contractually binding. Those classes matter more than the table name.
The replay model should follow that invariant class too, not the storage subsystem alone. Append-friendly state, last-write-wins state, monotonic state, and versioned workflow state do not tolerate the same replay behavior.
The scale math matters fast.
Take a service at 4,000 writes per second:
queue append p50: 3 ms
API p50: 5 ms
normal flush target: 150 ms
steady drain rate: 4,300 writes per second
Then the origin slows and workers drain only 3,200 writes per second for ten minutes.
That creates 480,000 acknowledged but not yet materialized writes.
If safe recovery afterward is 4,500 writes per second, headroom above live traffic is only 500 writes per second. Catch-up takes about 16 minutes, assuming replay parallelism is available, per-key ordering is not the limiter, and no repair work competes with catch-up.
Push harder: at 12,000 writes per second, a 15 percent drain slowdown for 20 minutes creates 2.16 million delayed writes. With only 900 writes per second of safe headroom, catch-up is about 40 minutes.
That is the real comparison. Not 5 ms versus 15 ms. It is foreground latency versus minutes of acknowledged state that has still not become true in the source of record.
For rebuildable projections, recommendation outputs, non-critical feed state, and some forms of bulk materialization, this trade can be right. But the queue has now joined the truth boundary. It is no longer plumbing.
That is where many designs become operationally ugly. The architecture still looks elegant. The runbook now needs backlog age, replay headroom, hot-partition diagnostics, selective suppression, targeted replay, and a human answer to "did we save it or not?"
A stronger judgment is warranted here: many organizations should not run write-behind for user-visible primary state. If you do not have per-key replay inspection, selective repair, backlog-age visibility, and a credible story for disagreement after failover, the design is usually less mature than the latency graph suggests.
Case 3: Cache updated and durable queue appended
Teams reach for this because they want two things at once: local read-your-write and later durable repair.
That can work. It is also where incidents get more confusing:
cache update succeeds, queue append fails
queue append succeeds, cache update fails
both succeed, flush partially succeeds
flush succeeds, stale refill later overwrites cache
replay lands after newer live traffic and clobbers correct state
This is the version that most often fools dashboards. Cache hit rate stays green. API latency stays green. Consumers look healthy. Then a failover or eviction strips away the masking layer and exposes that the origin has been behind for minutes.
That is the real lesson. Write-behind usually fails by creating multiple competing versions of truth, not by simply becoming slow.
A non-obvious point: local read-your-write is often the feature that persuades teams the design is sound. It can also be the feature that hides the problem the longest. The same warm cache path that makes demos look great can keep the team from seeing that other readers are already on older truth.
Read-through
Read-through standardizes the miss path:
request arrives
cache miss
service loads from origin
service populates cache
response returns
That is useful. It removes a lot of sloppy refill logic.
But it says almost nothing about mutation completion. If the write path updates origin and leaves cache alone, read-through eventually refills from durable state. If the write path updates cache first and origin later, read-through may preserve the illusion until expiry, then reintroduce older durable state.
Teams often over-credit read-through because it makes the read path look tidy. Tidy refill logic is not the same thing as a sound authority model.
What the caller thinks happened versus what is durably true
This is where the contract usually breaks.
When an API returns 200, the caller typically hears: "the system has my change."
Under write-through, that may be close to true at the primary-store boundary.
Under write-behind, it may mean any of these:
cache was updated
request was accepted in process memory
mutation was durably appended to a queue
a worker will attempt the write later
Those are not minor variations of one guarantee. They are different contracts.
If product language, UI behavior, and runbooks treat them as equivalent, the system is telling a prettier story than the write path can defend.
How It Evolves with Scale [GO DEEP]
The main thing that changes with scale is not just throughput. It is which metric stops telling the truth first.
At smaller scale
Imagine a billing admin tool at 200 writes per second. The primary can commit in 8 to 15 ms. Cache mostly exists to keep reads cheap.
Write-through is usually the better default. Saving 8 or 10 ms is rarely worth importing replay machinery, lag observability, divergence detection, and a second operational truth path.
This is overkill unless write latency is already dominating the product experience or the origin path is measurably the bottleneck you cannot avoid another way.
At medium scale
Now take a consumer service at 3,000 writes per second. Each write hits a hot table. Invalidation touches multiple keys. Bursts push p99 above 120 ms.
The first thing bending is often the origin path:
commit latency rises
lock contention rises
log flush pressure rises
replication lag rises
invalidation fanout worsens tail behavior
retries amplify existing load
This is where write-behind starts to sound attractive.
Sometimes it should. But the moment acknowledgment moves ahead of materialization, queue lag stops being a backend detail. It becomes uncommitted truth inventory.
At 3,000 writes per second, 90 seconds of lag means 270,000 acknowledged mutations are still not true in origin. That is not merely worker slowness. It is correctness exposure.
At larger scale
Now take 40,000 writes per second across regions, with regional caches, sharded origin storage, and downstream consumers reading from streams or materializations.
At that point the question is no longer "can we ingest?" It is "can the system still maintain one credible history of truth under replay, failover, and uneven regional state?"
What changes at 10x scale:
-
Order matters more than raw speed
A replay path that is fast but misordered is worse than a slower one that preserves semantics.
-
Drain throughput becomes a hard recovery budget
If ingress is 40,000 writes per second and safe drain is 44,000, headroom is only 4,000. A short slowdown can create millions of delayed writes and turn catch-up into a live incident.
-
Disagreement stops being local
A stale cache entry can now influence authorization, routing, limits, or other downstream decisions.
-
Recovery window becomes a product question
A 45-minute replay window is a 45-minute period during which acknowledged writes are still becoming true.
-
Observability gets more expensive
Queue depth is not enough. You need oldest unmaterialized write age, commit-to-materialization delay, replay headroom, divergence counts, and some way to estimate outstanding acknowledged-but-not-materialized state.
-
Failover becomes a truth event
The cache layer that had been masking lag disappears, and the older state in origin becomes visible immediately.
Write-through usually tells you it is hurting because latency rises. Write-behind can keep looking healthy while correctness gets worse.
What stops scaling cleanly first
Each pattern fails at a different boundary.
Write-through
Usually at synchronous write amplification in origin, and sometimes the next pain is foreground invalidation fanout.
Write-behind
Usually at drain lag, replay headroom, or replay semantics, not request latency.
Read-through
Usually at interpretability, when teams stop being able to tell whether cache is reflecting the authority model or hiding its weakness.
Batching changes the economics and the blast radius
At scale, write-behind usually means batching. Workers flush groups of 100 or 500 updates every 20 to 50 milliseconds.
That helps throughput. It also enlarges the failure unit:
one failed batch may represent meaningful user-visible state
retries duplicate groups, not single writes
partial batch success complicates idempotency
replay becomes burstier and harder to inspect
Batching is not a free gain. It trades efficiency for larger repair units.
One thing teams consistently underestimate is recovery pain after the store is already "back." The database graph looks normal again. The incident is still running, because replay is now competing with live traffic and every cautious decision slows restoration of truth.
Correctness-repair cost can dominate the steady-state win
This is the part teams discover late.
A write-through path may cost 15 ms. A write-behind path may cost 5 ms. That looks like a clear win until failure forces you to pay for:
replay ordering guarantees
version or compare-and-set protection
divergence detection
targeted repair tooling
support workflows for "saved but missing"
incident modes where the system is alive but not trustworthy
At enough scale, the cost of proving what really happened after failure can dominate the request-path savings that justified the design.
This section should separate the things teams actually collapse into one decision.
Write-through
Acknowledged: origin commit
Durable: origin
Observed next: whatever readers see after cache invalidation, which may still lag elsewhere
Typical confusion: teams assume honest primary commit means globally fresh reads
Write-behind
Acknowledged: cache update, durable queue append, or sometimes only local acceptance
Durable: maybe the queue, maybe nothing credible yet
Observed next: often cache first, origin later
Typical confusion: teams collapse durable intent, visible state, and materialized state into one idea of "written"
Read-through
Acknowledged: not relevant as a write concept
Durable: whatever origin already was
Observed next: cache refill from origin on miss
Typical confusion: teams mistake centralized refill behavior for a source-of-truth policy
The common conflation is not "three cache patterns." It is three different layers of semantics:
what was acknowledged
what is durable
what future reads will observe
A design doc that says "read-through plus write-behind for performance" but does not define those three things is not unfinished around the edges. It is unfinished at the center.
Write-through trade-offs
You pay synchronous latency and direct origin pressure. If invalidation fanout is large, that foreground cost can become painful fast.
In return, the distance between "success" and "durable truth" stays short. That is usually worth a great deal for primary state.
Write-through also fails in a more bounded way. If cache update fails after commit, the repair story is usually invalidate, repopulate, or targeted repair. That is still work, but it is not replay archaeology.
Write-behind trade-offs
You get lower request latency, batching opportunities, and smoother origin load.
You also inherit:
durable queue semantics
replay ordering and idempotency
backlog monitoring
divergence detection
cache correction rules
tooling to inspect, replay, suppress, or repair specific mutations
explicit measurement of how late truth is
That is not a little more complexity. It is a second correctness system.
And the trade is not just operational weight. It is semantic weight. The moment you acknowledge before materialization, the system needs a stronger story about what success actually means.
A stronger engineering judgment here: many teams are too eager to pay this price. Unless the origin write path is clearly the bottleneck and the state is clearly repairable under the invariants that matter, write-behind is usually a more expensive choice than it first appears.
Read-through trade-offs
You get a cleaner miss path and less repeated refill code.
You also risk making authority look tidier than it is. A read-through refill can look like a harmless cache event when it is actually the moment older durable state re-enters cache and exposes the contract gap.
These incidents are hard because the first visible symptom is often not the real failure.
Failure mode 1: Success returned, cache updated, origin never got the write
Trigger
Cache is updated and success is returned before any durable commit or durable queue append exists. Then the process dies.
Visible symptom
The user sees the new value immediately. Later it reverts after expiry, eviction, or failover.
Hidden impact
The system acknowledged state it never durably had. Side effects may already have been triggered from that speculative view.
How it spreads
The hot path keeps looking correct because it keeps reading the same warm key. Another node, region, or service reads origin and sees older truth.
Why it is hard to debug
The first metrics lie. Hit rate is green. Success rate is green. Latency is green. Origin looks healthy because it never got the write. Teams often blame invalidation first.
One bad on-call move is clearing cache to fix staleness. Sometimes that is exactly what exposes how many acknowledged writes were living only in cache.
Mitigation
Do not acknowledge before durable commit or durable queue append. If speculative visibility exists, bound it explicitly and do not let other systems treat it as committed.
Operational cost
Higher write latency, or a queue that must now be operated as part of the truth boundary.
Failure mode 2: Durable queue exists, but backlog becomes correctness exposure
Trigger
Drain throughput falls below ingress because origin slows, a hot shard saturates, replay is throttled, or a schema change makes writes heavier.
Visible symptom
Very little changes at first. API latency stays low. Cache hit rate stays high. Consumers remain healthy.
Hidden impact
Acknowledged but not yet materialized writes accumulate. The system is promising more truth than it can currently produce.
How it spreads
Origin readers are stale first. Then cache evictions reveal the gap. Then failover or cross-service reads turn it into visible disagreement.
Why it is hard to debug
Teams often watch queue depth, not oldest unmaterialized write age. Consumer health can stay green while one hot partition is minutes behind.
A very familiar pattern is this: consumer CPU looked fine, autoscaling looked fine, cache looked fine. The only metric telling the truth was oldest pending mutation age, and it was not on the front page.
Mitigation
Track ingress, safe drain, oldest outstanding mutation age, replay headroom, and divergence rate. Decide in advance when to shed optional writes or downgrade weaker semantics.
Operational cost
Reserved capacity, more instrumentation, and harder operating policy.
Failure mode 3: Cache and origin disagree after failover
Write-behind is active, cache is warm, origin is behind, and a cache node or region is lost.
What shows up first is not a neat consistency bug. Same key, different answers, depending on which path was hit. Downstream systems may already be enforcing old state while front-end paths still reflect cached state.
Teams often start with replication lag or cold-cache diagnosis. Those may be present, but they are secondary. The real issue is that visible state outran durable state by contract.
The practical repair is expensive. Decide which source wins during the repair window. Force affected reads to origin if necessary. Invalidate speculative cache state. Pause dependent workflows where needed. Failover gets more expensive, origin read load spikes, and now you need targeted invalidation and per-key diagnostics just to get back to one believable answer.
Failure mode 4: Replay succeeds mechanically while semantics are wrong
Trigger
Recovery replays older or duplicate mutations. Delivery works. Ordering does not.
Visible symptom
Queue lag falls. Replay throughput looks good. Database writes succeed.
Hidden impact
Older state overwrites newer truth, or duplicate side effects fire again.
How it spreads
Wrong state lands in origin, gets cached, and becomes the basis for later decisions.
Why it is hard to debug
The green signals are the misleading ones. Replay is fast. Consumers are draining. Teams blame cache because bad values appear there first. Later they discover the cache is correctly reflecting an origin damaged by replay order.
Mitigation
Use per-entity versions, sequence IDs, compare-and-set semantics, and idempotency keys where the invariant requires them. Some invariants are append-friendly. Some are last-write-wins. Some are monotonic. Some are not safely replayable without a versioned state machine. "Replay-safe" is never one generic property.
Operational cost
More write metadata, more complicated apply logic, slower replay, and harder debugging.
Warning: a drained queue does not mean recovered state. It may only mean the wrong history was applied efficiently.
Failure mode 5: Read-through turns source-of-truth confusion into a debugging trap
Read-through sits on top of a weak write path. Cache entries expire while origin is still behind.
A miss refills from origin and the old value reappears. To the team, it looks routine. To the user, saved state has reverted.
The debugging trap is that the logs look ordinary: miss, read, fill. The refill is not the bug. The bug is that origin was still old when the product had already behaved as though the change was complete.
The mitigation is to make authority explicit. If origin is canonical, do not present cache-only writes as complete. If speculative state is allowed, isolate it or tag it. That costs elegance. It buys honesty.
Failure mode 6: Reconciliation is the first truthful signal, and it arrives late
Reconciliation compares queue, cache, and origin after lag, replay, or failure. By then the incident is already old.
It is useful as confirmation and repair input, not as the primary detector. If reconciliation is the first place you learn what was actually committed, the system is already too loose at the boundary that mattered.
What finally re-establishes trustworthy state
Trustworthy state usually comes back in a sequence:
stop issuing ambiguous acknowledgments on the affected path
declare which store is authoritative during repair
replay or drain with ordering protections on
invalidate or rebuild speculative or stale cache entries
reconcile against the chosen authority
only then restore dependent workflows fully
That is expensive operational work. The fast write path was borrowing against exactly this moment.
When To Use / When NOT To Use
Use write-through when
Use it for primary state where success should mean durable acceptance:
account balances
billing state
entitlements
order transitions
compliance-sensitive settings
any record users or support will later treat as binding
Do not use write-through when
Do not force it everywhere out of taste. Avoid it where write amplification is extreme, origin is clearly the bottleneck, and delayed materialization is acceptable.
Examples:
bulk telemetry rollups
some recommendation outputs
some feed fanout state
denormalized aggregates that can be rebuilt
That said, the real divider is not data label. It is invariant. Ask whether the state is rebuildable, externally visible, monotonic, mergeable, or contractually binding. Those questions are better design guides than "metadata" versus "business data."
Use write-behind when
Use it when the contract can honestly be weaker and the team is willing to own the recovery machinery.
Good candidates:
rebuildable projections
non-critical derived state
batch-friendly materializations
workflows where durable queue append is the real commitment boundary
Do not use write-behind when
Do not use it for data whose meaning depends on immediate durable truth.
Also do not use it when the team is not ready to own:
durable handoff guarantees
per-entity ordering or version protection
replay-safe application
lag observability
divergence detection
cache correction after failed flush
recovery-time planning, not just steady-state throughput planning
Be more skeptical than the industry usually is. Many systems adopt write-behind because the latency graph looks better, not because the correctness contract has been made explicit. That is usually a bad reason.
Use read-through when
Use it to centralize refill logic and reduce duplicated miss handling.
Do not use read-through as
Do not use it as a substitute for a write contract. It does not answer who owns truth after mutation, and it can make a weak answer look more organized than it really is.
Senior engineers usually start by writing the guarantee in plain language.
They ask:
What exactly does success mean?
Which system is authoritative in the first second after a write?
If cache and origin disagree, which one wins?
What is the maximum truth lag we will tolerate?
Can we observe that lag directly?
If the writer dies after success, what durable evidence remains?
Can replay duplicate, reorder, or overwrite newer truth?
How much recovery headroom exists above live traffic?
If backlog reaches one million writes, how long until truth is restored?
What will support say when a user says "I saved this"?
Is reconciliation a safety net, or core correctness machinery?
They also separate data classes. Mature systems rarely use one strategy everywhere. They use write-through where state is primary, write-behind where state is derived and repairable, and read-through where refill centralization is actually useful.
In some products, the right move is not stronger storage semantics. It is a more honest state model. "Accepted" and "materialized" may need to be separate product states rather than one overloaded status pretending to mean both.
One more habit matters. They resist calling cache a source of truth unless the organization truly operates it like one. Most teams do not back it up, audit it, repair it, or investigate it like authoritative storage. So when a design quietly makes cache authoritative for a period of time, it is borrowing truth from a system the organization does not actually govern as truth.
That is the durability illusion.
Cache write strategy is not a small performance knob. It is a contract about completion, authority, and recovery.
Write-through spends more on the foreground path and usually buys a narrower, more honest boundary.
Write-behind can be the right design, but only when success means durable handoff, not cache visibility, and only when replay, ordering, divergence detection, and repair are treated as first-class parts of correctness.
The final mistake to avoid is calling a system "fast writes" when what it really has is delayed truth plus fragile recovery semantics. That design can still be right. But it is not simpler, not cheaper to operate, and not safer just because the request returned quickly.