Cache Write Strategies and the Durability Illusions They Create

Cache Write Strategies and the Durability Illusions They Create | ArchCrux

Core insight: Write-through, write-behind, and read-through are not symmetric cache patterns. They answer different questions about completion.

Diagram placeholder

Request Path Diagram: Visible Success versus Durable Truth

Show, side by side, how write-through, write-behind, and read-through differ on four questions: where the client gets success, where the value becomes visible, where it becomes durable, and where cache and store can disagree. This diagram should make the article’s main point obvious in one glance: fast acknowledgment and durable truth are different events.

Placement note: Place near the central request path or failure section.

Why This Exists#

Teams reach for these patterns for real reasons. Synchronous writes are expensive. A single logical mutation may imply WAL or redo flush, replication work, CDC emission, index maintenance, and several invalidations. The code says save(). The system pays for much more.

At modest write volume, write-through is often boring in the best possible way. At higher volume, it starts to look expensive. Then someone proposes the obvious shortcut: acknowledge early, keep the cache hot, flush later.

That usually improves latency immediately. It also creates a second state machine, one that now has to explain pending truth, delayed truth, replayed truth, and corrected truth. Most teams think they adopted write-behind to protect the database. They also adopted a recovery semantics problem whether they meant to or not.

Intuition#

Reason about each pattern through three paths:

Acknowledgment path What happened before we returned success?

Durability path Which system can still prove the mutation exists?

Recovery path If the writer dies now, what reconstructs truth?

Most articles spend too much time on steady-state request flow. The sharper test is harsher:

If the process dies one millisecond after returning 200, what can I still prove happened?

If the answer is "the cache had the new value for a while," you do not have a durable write story. You have a temporary visual effect.

Baseline Architecture#

Assume a service with:

application servers a distributed cache such as Redis or Memcached an origin store such as Postgres, MySQL, Cassandra, or DynamoDB optionally a durable queue or log between the service and async writers some divergence detection or reconciliation process

Write-through Write origin first, then update or invalidate cache, then acknowledge.

Write-behind Acknowledge after cache update, queue append, or both, then materialize later.

Read-through On miss, load from origin, populate cache, return response.

Read-through matters less here than most teams think. It centralizes refill. It does not settle who owns truth after mutation.

Request Path Walkthrough#

This is where the design actually lives. Not in the cache topology. In what the system has promised by the time it returns success, and what it can still defend afterward.

Write-through

The usual flow is simple:

client sends update service validates and writes origin origin commit succeeds service updates or invalidates cache service returns 200

If origin commit takes 10 to 14 ms, invalidation adds 1 to 2 ms, and service overhead is another 2 to 3 ms, p50 may land around 14 to 19 ms. Under contention, p99 can easily move into the 80 to 150 ms range.

That is the visible cost.

What the pattern buys is narrower semantics. By the time the caller sees success, the source of record has accepted the write. If cache update fails afterward, you have stale-read risk, not acknowledged-write loss. Those are very different incidents.

The first bottleneck is usually the synchronous origin path, not the cache:

lock contention on hot rows or partitions WAL or redo flush pressure replication acknowledgment delay index maintenance invalidation fanout when one logical write touches many keys

At 500 writes per second, those costs may be background noise. At 5,000, one "simple" update may imply primary commit, replication, CDC emission, index work, and several invalidations. Write-through often hits synchronous write amplification before it hits anything cache-specific.

For primary state, that is still usually the correct default. Money movement, entitlements, compliance-sensitive settings, order transitions, and records users will later cite as "I saved this" should usually pay foreground cost rather than hide it.

That is a judgment call, not a theorem. If your product contract really says "accepted for processing" rather than "saved," weaker semantics can be fine. But product wording is part of the consistency model. A button label, webhook, receipt email, or downstream event often commits you to semantics stronger than the storage team thinks it is providing.

The caveat is narrower than people think. Write-through is honest about origin commit, not necessarily about all readers. If reads come from lagging replicas, search indexes, other regions, or downstream materializations, the guarantee stops at the primary boundary unless the product contract says otherwise.

There is also a real operational cost people understate here. Teams fix correctness at the write boundary, then discover that synchronous invalidation fanout has become the next bottleneck. That is still usually a better problem than proving whether the write ever existed, but it is not free.

Write-behind

The common flow is:

client sends update service validates service updates cache, appends to a queue, or both service returns 200 workers materialize later

Now p50 may drop to 2 to 5 ms.

This is where teams start confusing speed with completion.

The question is not whether the request was fast. The question is what happened before step 4.

There are three materially different answers.

Case 1: Cache updated, no durable handoff

This is the dangerous form.

The cache shows the new value. The caller sees success. No durable store has committed the write, and no durable queue has recorded intent.

If the process dies, the write can disappear.

That is not a weaker durability contract. It is speculative state presentation.

Warning: if success is returned after cache update but before durable handoff, the system is allowed to show users state it never durably possessed.

Some teams do this intentionally for low-risk pending state. That can be acceptable, but only if it is explicitly presented as pending and no downstream system treats it as committed. The mistake is not speculation. The mistake is speculation dressed up as completion.

Case 2: Durable queue append succeeded, origin flush deferred

This is the form that can be justified.

The service returns success only after appending a mutation to durable storage. Origin is updated later.

Now the contract is weaker, but coherent: "your intent is durably recorded and will be materialized."

That boundary is still narrower than many teams assume. Durable queue append only buys safety if the queue’s retention, acknowledgment, replication, and replay semantics are adequate for the invariant you care about. A durable queue that cannot preserve the order your invariant needs is still a weak truth boundary.

The important classification here is not feeds versus counters versus metadata. It is invariant type. Some state is rebuildable. Some is externally observable. Some is monotonic. Some is mergeable. Some is contractually binding. Those classes matter more than the table name.

The replay model should follow that invariant class too, not the storage subsystem alone. Append-friendly state, last-write-wins state, monotonic state, and versioned workflow state do not tolerate the same replay behavior.

The scale math matters fast.

Take a service at 4,000 writes per second:

queue append p50: 3 ms API p50: 5 ms normal flush target: 150 ms steady drain rate: 4,300 writes per second

Then the origin slows and workers drain only 3,200 writes per second for ten minutes.

That creates 480,000 acknowledged but not yet materialized writes.

If safe recovery afterward is 4,500 writes per second, headroom above live traffic is only 500 writes per second. Catch-up takes about 16 minutes, assuming replay parallelism is available, per-key ordering is not the limiter, and no repair work competes with catch-up.

Push harder: at 12,000 writes per second, a 15 percent drain slowdown for 20 minutes creates 2.16 million delayed writes. With only 900 writes per second of safe headroom, catch-up is about 40 minutes.

That is the real comparison. Not 5 ms versus 15 ms. It is foreground latency versus minutes of acknowledged state that has still not become true in the source of record.

For rebuildable projections, recommendation outputs, non-critical feed state, and some forms of bulk materialization, this trade can be right. But the queue has now joined the truth boundary. It is no longer plumbing.

That is where many designs become operationally ugly. The architecture still looks elegant. The runbook now needs backlog age, replay headroom, hot-partition diagnostics, selective suppression, targeted replay, and a human answer to "did we save it or not?"

A stronger judgment is warranted here: many organizations should not run write-behind for user-visible primary state. If you do not have per-key replay inspection, selective repair, backlog-age visibility, and a credible story for disagreement after failover, the design is usually less mature than the latency graph suggests.

Case 3: Cache updated and durable queue appended

Teams reach for this because they want two things at once: local read-your-write and later durable repair.

That can work. It is also where incidents get more confusing:

cache update succeeds, queue append fails queue append succeeds, cache update fails both succeed, flush partially succeeds flush succeeds, stale refill later overwrites cache replay lands after newer live traffic and clobbers correct state

This is the version that most often fools dashboards. Cache hit rate stays green. API latency stays green. Consumers look healthy. Then a failover or eviction strips away the masking layer and exposes that the origin has been behind for minutes.

That is the real lesson. Write-behind usually fails by creating multiple competing versions of truth, not by simply becoming slow.

A non-obvious point: local read-your-write is often the feature that persuades teams the design is sound. It can also be the feature that hides the problem the longest. The same warm cache path that makes demos look great can keep the team from seeing that other readers are already on older truth.

Read-through

Read-through standardizes the miss path:

request arrives cache miss service loads from origin service populates cache response returns

That is useful. It removes a lot of sloppy refill logic.

But it says almost nothing about mutation completion. If the write path updates origin and leaves cache alone, read-through eventually refills from durable state. If the write path updates cache first and origin later, read-through may preserve the illusion until expiry, then reintroduce older durable state.

Teams often over-credit read-through because it makes the read path look tidy. Tidy refill logic is not the same thing as a sound authority model.

What the caller thinks happened versus what is durably true

This is where the contract usually breaks.

When an API returns 200, the caller typically hears: "the system has my change."

Under write-through, that may be close to true at the primary-store boundary.

Under write-behind, it may mean any of these:

cache was updated request was accepted in process memory mutation was durably appended to a queue a worker will attempt the write later

Those are not minor variations of one guarantee. They are different contracts.

If product language, UI behavior, and runbooks treat them as equivalent, the system is telling a prettier story than the write path can defend.

How It Evolves with Scale [GO DEEP]

The main thing that changes with scale is not just throughput. It is which metric stops telling the truth first.

At smaller scale

Imagine a billing admin tool at 200 writes per second. The primary can commit in 8 to 15 ms. Cache mostly exists to keep reads cheap.

Write-through is usually the better default. Saving 8 or 10 ms is rarely worth importing replay machinery, lag observability, divergence detection, and a second operational truth path.

This is overkill unless write latency is already dominating the product experience or the origin path is measurably the bottleneck you cannot avoid another way.

At medium scale

Now take a consumer service at 3,000 writes per second. Each write hits a hot table. Invalidation touches multiple keys. Bursts push p99 above 120 ms.

The first thing bending is often the origin path:

commit latency rises lock contention rises log flush pressure rises replication lag rises invalidation fanout worsens tail behavior retries amplify existing load

This is where write-behind starts to sound attractive.

Sometimes it should. But the moment acknowledgment moves ahead of materialization, queue lag stops being a backend detail. It becomes uncommitted truth inventory.

At 3,000 writes per second, 90 seconds of lag means 270,000 acknowledged mutations are still not true in origin. That is not merely worker slowness. It is correctness exposure.

At larger scale

Now take 40,000 writes per second across regions, with regional caches, sharded origin storage, and downstream consumers reading from streams or materializations.

At that point the question is no longer "can we ingest?" It is "can the system still maintain one credible history of truth under replay, failover, and uneven regional state?"

What changes at 10x scale:

Order matters more than raw speed A replay path that is fast but misordered is worse than a slower one that preserves semantics.
Drain throughput becomes a hard recovery budget If ingress is 40,000 writes per second and safe drain is 44,000, headroom is only 4,000. A short slowdown can create millions of delayed writes and turn catch-up into a live incident.
Disagreement stops being local A stale cache entry can now influence authorization, routing, limits, or other downstream decisions.
Recovery window becomes a product question A 45-minute replay window is a 45-minute period during which acknowledged writes are still becoming true.
Observability gets more expensive Queue depth is not enough. You need oldest unmaterialized write age, commit-to-materialization delay, replay headroom, divergence counts, and some way to estimate outstanding acknowledged-but-not-materialized state.
Failover becomes a truth event The cache layer that had been masking lag disappears, and the older state in origin becomes visible immediately.

Write-through usually tells you it is hurting because latency rises. Write-behind can keep looking healthy while correctness gets worse.

What stops scaling cleanly first

Each pattern fails at a different boundary.

Write-through Usually at synchronous write amplification in origin, and sometimes the next pain is foreground invalidation fanout.

Write-behind Usually at drain lag, replay headroom, or replay semantics, not request latency.

Read-through Usually at interpretability, when teams stop being able to tell whether cache is reflecting the authority model or hiding its weakness.

Batching changes the economics and the blast radius

At scale, write-behind usually means batching. Workers flush groups of 100 or 500 updates every 20 to 50 milliseconds.

That helps throughput. It also enlarges the failure unit:

one failed batch may represent meaningful user-visible state retries duplicate groups, not single writes partial batch success complicates idempotency replay becomes burstier and harder to inspect

Batching is not a free gain. It trades efficiency for larger repair units.

One thing teams consistently underestimate is recovery pain after the store is already "back." The database graph looks normal again. The incident is still running, because replay is now competing with live traffic and every cautious decision slows restoration of truth.

Correctness-repair cost can dominate the steady-state win

This is the part teams discover late.

A write-through path may cost 15 ms. A write-behind path may cost 5 ms. That looks like a clear win until failure forces you to pay for:

replay ordering guarantees version or compare-and-set protection divergence detection targeted repair tooling support workflows for "saved but missing" incident modes where the system is alive but not trustworthy

At enough scale, the cost of proving what really happened after failure can dominate the request-path savings that justified the design.

The Mechanisms, Distinguished#

This section should separate the things teams actually collapse into one decision.

Write-through Acknowledged: origin commit Durable: origin Observed next: whatever readers see after cache invalidation, which may still lag elsewhere Typical confusion: teams assume honest primary commit means globally fresh reads Write-behind Acknowledged: cache update, durable queue append, or sometimes only local acceptance Durable: maybe the queue, maybe nothing credible yet Observed next: often cache first, origin later Typical confusion: teams collapse durable intent, visible state, and materialized state into one idea of "written" Read-through Acknowledged: not relevant as a write concept Durable: whatever origin already was Observed next: cache refill from origin on miss Typical confusion: teams mistake centralized refill behavior for a source-of-truth policy

The common conflation is not "three cache patterns." It is three different layers of semantics:

what was acknowledged what is durable what future reads will observe

A design doc that says "read-through plus write-behind for performance" but does not define those three things is not unfinished around the edges. It is unfinished at the center.

Trade-offs#

Write-through trade-offs

You pay synchronous latency and direct origin pressure. If invalidation fanout is large, that foreground cost can become painful fast.

In return, the distance between "success" and "durable truth" stays short. That is usually worth a great deal for primary state.

Write-through also fails in a more bounded way. If cache update fails after commit, the repair story is usually invalidate, repopulate, or targeted repair. That is still work, but it is not replay archaeology.

Write-behind trade-offs

You get lower request latency, batching opportunities, and smoother origin load.

You also inherit:

durable queue semantics replay ordering and idempotency backlog monitoring divergence detection cache correction rules tooling to inspect, replay, suppress, or repair specific mutations explicit measurement of how late truth is

That is not a little more complexity. It is a second correctness system.

And the trade is not just operational weight. It is semantic weight. The moment you acknowledge before materialization, the system needs a stronger story about what success actually means.

A stronger engineering judgment here: many teams are too eager to pay this price. Unless the origin write path is clearly the bottleneck and the state is clearly repairable under the invariants that matter, write-behind is usually a more expensive choice than it first appears.

Read-through trade-offs

You get a cleaner miss path and less repeated refill code.

You also risk making authority look tidier than it is. A read-through refill can look like a harmless cache event when it is actually the moment older durable state re-enters cache and exposes the contract gap.

Failure Modes#

These incidents are hard because the first visible symptom is often not the real failure.

Failure mode 1: Success returned, cache updated, origin never got the write

Trigger Cache is updated and success is returned before any durable commit or durable queue append exists. Then the process dies.

Visible symptom The user sees the new value immediately. Later it reverts after expiry, eviction, or failover.

Hidden impact The system acknowledged state it never durably had. Side effects may already have been triggered from that speculative view.

How it spreads The hot path keeps looking correct because it keeps reading the same warm key. Another node, region, or service reads origin and sees older truth.

Why it is hard to debug The first metrics lie. Hit rate is green. Success rate is green. Latency is green. Origin looks healthy because it never got the write. Teams often blame invalidation first.

One bad on-call move is clearing cache to fix staleness. Sometimes that is exactly what exposes how many acknowledged writes were living only in cache.

Mitigation Do not acknowledge before durable commit or durable queue append. If speculative visibility exists, bound it explicitly and do not let other systems treat it as committed.

Operational cost Higher write latency, or a queue that must now be operated as part of the truth boundary.

Failure mode 2: Durable queue exists, but backlog becomes correctness exposure

Trigger Drain throughput falls below ingress because origin slows, a hot shard saturates, replay is throttled, or a schema change makes writes heavier.

Visible symptom Very little changes at first. API latency stays low. Cache hit rate stays high. Consumers remain healthy.

Hidden impact Acknowledged but not yet materialized writes accumulate. The system is promising more truth than it can currently produce.

How it spreads Origin readers are stale first. Then cache evictions reveal the gap. Then failover or cross-service reads turn it into visible disagreement.

Why it is hard to debug Teams often watch queue depth, not oldest unmaterialized write age. Consumer health can stay green while one hot partition is minutes behind.

A very familiar pattern is this: consumer CPU looked fine, autoscaling looked fine, cache looked fine. The only metric telling the truth was oldest pending mutation age, and it was not on the front page.

Mitigation Track ingress, safe drain, oldest outstanding mutation age, replay headroom, and divergence rate. Decide in advance when to shed optional writes or downgrade weaker semantics.

Operational cost Reserved capacity, more instrumentation, and harder operating policy.

Failure mode 3: Cache and origin disagree after failover

Write-behind is active, cache is warm, origin is behind, and a cache node or region is lost.

What shows up first is not a neat consistency bug. Same key, different answers, depending on which path was hit. Downstream systems may already be enforcing old state while front-end paths still reflect cached state.

Teams often start with replication lag or cold-cache diagnosis. Those may be present, but they are secondary. The real issue is that visible state outran durable state by contract.

The practical repair is expensive. Decide which source wins during the repair window. Force affected reads to origin if necessary. Invalidate speculative cache state. Pause dependent workflows where needed. Failover gets more expensive, origin read load spikes, and now you need targeted invalidation and per-key diagnostics just to get back to one believable answer.

Failure mode 4: Replay succeeds mechanically while semantics are wrong

Trigger Recovery replays older or duplicate mutations. Delivery works. Ordering does not.

Visible symptom Queue lag falls. Replay throughput looks good. Database writes succeed.

Hidden impact Older state overwrites newer truth, or duplicate side effects fire again.

How it spreads Wrong state lands in origin, gets cached, and becomes the basis for later decisions.

Why it is hard to debug The green signals are the misleading ones. Replay is fast. Consumers are draining. Teams blame cache because bad values appear there first. Later they discover the cache is correctly reflecting an origin damaged by replay order.

Mitigation Use per-entity versions, sequence IDs, compare-and-set semantics, and idempotency keys where the invariant requires them. Some invariants are append-friendly. Some are last-write-wins. Some are monotonic. Some are not safely replayable without a versioned state machine. "Replay-safe" is never one generic property.

Operational cost More write metadata, more complicated apply logic, slower replay, and harder debugging.

Warning: a drained queue does not mean recovered state. It may only mean the wrong history was applied efficiently.

Failure mode 5: Read-through turns source-of-truth confusion into a debugging trap

Read-through sits on top of a weak write path. Cache entries expire while origin is still behind.

A miss refills from origin and the old value reappears. To the team, it looks routine. To the user, saved state has reverted.

The debugging trap is that the logs look ordinary: miss, read, fill. The refill is not the bug. The bug is that origin was still old when the product had already behaved as though the change was complete.

The mitigation is to make authority explicit. If origin is canonical, do not present cache-only writes as complete. If speculative state is allowed, isolate it or tag it. That costs elegance. It buys honesty.

Failure mode 6: Reconciliation is the first truthful signal, and it arrives late

Reconciliation compares queue, cache, and origin after lag, replay, or failure. By then the incident is already old.

It is useful as confirmation and repair input, not as the primary detector. If reconciliation is the first place you learn what was actually committed, the system is already too loose at the boundary that mattered.

What finally re-establishes trustworthy state

Trustworthy state usually comes back in a sequence:

stop issuing ambiguous acknowledgments on the affected path declare which store is authoritative during repair replay or drain with ordering protections on invalidate or rebuild speculative or stale cache entries reconcile against the chosen authority only then restore dependent workflows fully

That is expensive operational work. The fast write path was borrowing against exactly this moment.

When To Use / When NOT To Use Use write-through when

Use it for primary state where success should mean durable acceptance:

account balances billing state entitlements order transitions compliance-sensitive settings any record users or support will later treat as binding Do not use write-through when

Do not force it everywhere out of taste. Avoid it where write amplification is extreme, origin is clearly the bottleneck, and delayed materialization is acceptable.

Examples:

bulk telemetry rollups some recommendation outputs some feed fanout state denormalized aggregates that can be rebuilt

That said, the real divider is not data label. It is invariant. Ask whether the state is rebuildable, externally visible, monotonic, mergeable, or contractually binding. Those questions are better design guides than "metadata" versus "business data."

Use write-behind when

Use it when the contract can honestly be weaker and the team is willing to own the recovery machinery.

Good candidates:

rebuildable projections non-critical derived state batch-friendly materializations workflows where durable queue append is the real commitment boundary Do not use write-behind when

Do not use it for data whose meaning depends on immediate durable truth.

Also do not use it when the team is not ready to own:

durable handoff guarantees per-entity ordering or version protection replay-safe application lag observability divergence detection cache correction after failed flush recovery-time planning, not just steady-state throughput planning

Be more skeptical than the industry usually is. Many systems adopt write-behind because the latency graph looks better, not because the correctness contract has been made explicit. That is usually a bad reason.

Use read-through when

Use it to centralize refill logic and reduce duplicated miss handling.

Do not use read-through as

Do not use it as a substitute for a write contract. It does not answer who owns truth after mutation, and it can make a weak answer look more organized than it really is.

How Senior Engineers Think About This#

Senior engineers usually start by writing the guarantee in plain language.

They ask:

What exactly does success mean? Which system is authoritative in the first second after a write? If cache and origin disagree, which one wins? What is the maximum truth lag we will tolerate? Can we observe that lag directly? If the writer dies after success, what durable evidence remains? Can replay duplicate, reorder, or overwrite newer truth? How much recovery headroom exists above live traffic? If backlog reaches one million writes, how long until truth is restored? What will support say when a user says "I saved this"? Is reconciliation a safety net, or core correctness machinery?

They also separate data classes. Mature systems rarely use one strategy everywhere. They use write-through where state is primary, write-behind where state is derived and repairable, and read-through where refill centralization is actually useful.

In some products, the right move is not stronger storage semantics. It is a more honest state model. "Accepted" and "materialized" may need to be separate product states rather than one overloaded status pretending to mean both.

One more habit matters. They resist calling cache a source of truth unless the organization truly operates it like one. Most teams do not back it up, audit it, repair it, or investigate it like authoritative storage. So when a design quietly makes cache authoritative for a period of time, it is borrowing truth from a system the organization does not actually govern as truth.

That is the durability illusion.

Summary#

Cache write strategy is not a small performance knob. It is a contract about completion, authority, and recovery.

Write-through spends more on the foreground path and usually buys a narrower, more honest boundary.

Write-behind can be the right design, but only when success means durable handoff, not cache visibility, and only when replay, ordering, divergence detection, and repair are treated as first-class parts of correctness.

The final mistake to avoid is calling a system "fast writes" when what it really has is delayed truth plus fragile recovery semantics. That design can still be right. But it is not simpler, not cheaper to operate, and not safer just because the request returned quickly.