Core insight: Dual write and shadow read are not safety rituals. They are control patterns for running a migration while authority is temporarily split.
These patterns exist because live systems rarely get a clean stop-the-world migration window.
A team wants to move from one database to another, replace a search backend, split a service, or change the storage engine under account state. Traffic keeps coming. Writes keep landing. Product work does not freeze. Support still needs the system available. Finance still needs balances right. Search still needs relevance good enough that nobody notices the move.
So the migration happens in motion.
That creates a different problem from ordinary data movement. Old and new systems both contain state. Some came from backfill. Some came from live writes. Some was transformed. Some was rejected on one side and accepted on the other. Read behavior may not even be comparable enough to validate with simple equality. At that point, this is not mainly a transfer problem. It is a correctness and recovery problem with temporary split authority.
What engineers usually get wrong is obsessing over the cutover switch. The expensive failures often start earlier, when backfill races live writes, comparison logic normalizes away the only mismatch class that matters, or the team declares rollback available after the old path has already stopped being a safe source of truth.
The cleanest way to think about dual write is not "two writes for safety." Think of it as deliberately creating two ways for truth to drift.
That sounds severe, but it is closer to how these migrations behave. Dual write preserves continuity while the new system warms up. It also creates more mutation outcomes, more retry paths, more ordering questions, and more repair work.
Shadow read has the opposite trap. It feels safe because users do not see it. But non-user-facing does not mean high-signal. A shadow read only helps if the comparison is semantically strong enough to matter. Comparing JSON blobs from two search systems is weak evidence. Comparing eligibility, filter behavior, rank movement on important queries, and latency distributions is stronger. Comparing balance snapshots without proving version order or invariant preservation is often worse than useless because it manufactures confidence.
This is where teams fool themselves. The migration looks calm because request success is high and the graphs are green. Meanwhile the repair queue is quietly becoming the real write path, or the comparator is dropping the only differences that would have stopped the rollout. You can spend days staring at healthy dashboards while the truth is already wrong.
That is the right intuition. The danger is not mainly loud failure. It is quiet wrongness that gets harder to repair the longer it stays quiet.
Diagram placeholder
Migration Control Surfaces for Dual Write and Shadow Read
Show the migration as a control system, not just a duplicated request path, and make authority transition, evidence collection, repair, and rollback visible.
An old store currently owns writes and serves reads. A new store is being introduced for scale, cost, region strategy, query shape, or operational reasons. A backfill pipeline copies historical data from old to new. The application begins dual writing live mutations to both stores. A reconciliation path records missing writes, version conflicts, transform mismatches, and ordering anomalies. Reads stay on the old store at first, then move gradually, often with shadow reads still sampling against the old path.
What matters first is backfill discipline. The target needs some credible ordering primitive, version number, change sequence, or other monotonic boundary, so historical copy cannot overwrite fresher live state. “Reject older state” sounds simple on a whiteboard. It is not free once records are transformed, merged, or re-keyed on the way over.
What matters next is partial success on the live write path. Dual write only helps if the system has an explicit answer for one-side success, retry, deduplication, and replay order. If those semantics are weak, the migration has not created safety. It has created a second write surface and outsourced the consequences to reconciliation.
Then there is the read side. Exact equality is rare. Semantic equivalence is harder and usually more important. Search makes this obvious. The old cluster still serves users, the new one is indexed in parallel, and queries are shadowed out of band. That can be an effective validation pattern, but it is also easy to lie to yourself with a weak comparator.
Ledger and payment-state migrations deserve a separate instinct. Full dual write is often the wrong first move there. When balances, credits, or money movement are involved, I would rather narrow authority, strengthen invariants, or use forward-only event capture than create two writable truth surfaces and tell myself reconciliation will sort it out later.
Diagram placeholder
Live Write and Read Flow During Migration
Show how a single live request behaves while the system is in migration mode, with old and new systems both active, but not equally authoritative.
Placement note: Place near Request Path Walkthrough.
Phase one is backfill while the old store remains authoritative. This is where teams relax too early. They think no user traffic hits the new store yet, so risk is low. That is false. Correctness debt is already being created. If the backfill copies row version 7 after a live dual write has already produced version 9 in the new store, you have manufactured a stale overwrite unless the target rejects older state or preserves a monotonic version rule.
Warning: backfill is not preload. It is live traffic against future truth.
A small migration can look fine right up until this overlap begins. Imagine 5 million rows, 30 live writes per second, and backfill at 2,000 rows per second. In isolation, that copy finishes in under an hour. In reality, the hottest 2 percent of rows may be touched repeatedly during that same hour. If the backfill retries by key range instead of by version-aware record repair, you have built a stale-write machine.
Phase two is live dual write, with the old store still authoritative. A request updates account preferences. The application writes to old and new. The old store commits. The new store times out. Now what? If the request returns success because the old system still defines truth, the migration has incurred repair debt. That can be valid, but only if the system captures, replays, orders, and audits that missed write safely. Otherwise it is not dual write. It is a hope-based buffer against embarrassment.
The first bottleneck usually appears here, and it is rarely the raw write fanout. It is the unresolved-write queue behind it. At 500 writes per second, a 0.2 percent partial-failure rate produces about one unresolved write per second. That is noisy but manageable if replay is clean. At 20,000 writes per second, the same rate produces 40 unresolved writes per second, or 144,000 per hour. That is no longer a correctness footnote. It is a capacity system with queue depth, replay bandwidth, operator triage, and audit retention attached to it.
This is also where many migrations stop feeling neat. Both systems are live, both contain plausible state, and nobody is fully comfortable calling either one clean. The first pain is usually not an outage. It is triage. Which misses matter, which can be replayed safely, and whether the backlog is still evidence or already part of the write path.
Phase three introduces shadow reads. For a search migration, the user still gets results from the old path. The same query is sent to the new path asynchronously. Comparison now becomes the real product. If the new system returns slightly different scoring but preserves eligibility, filtering, and intent satisfaction, that may be acceptable. If the comparator only checks top 10 document IDs, it may miss a broken security filter. That kind of comparator does not reduce risk. It hides the only risk that mattered.
The first bottleneck on the shadow side is also rarely query execution. It is comparison usefulness under volume. At 2,000 read RPS, shadowing 5 percent adds 100 comparison candidates per second. A team can inspect outliers, refine the comparator, and learn. At 20,000 read RPS, the same 5 percent becomes 1,000 candidates per second. If 3 percent mismatch because ranking shifted, stale indexes differ, or schemas are not fully aligned, that produces 30 mismatches per second, or 108,000 per hour. Unless those mismatches are classified well, the signal collapses into noise.
Phase four is partial cutover. A subset of reads or tenants now use the new system. This is where clean migration diagrams usually stop being honest. The old system is no longer just the present. It is the reference path, the debugging crutch, and the supposed rollback target. If it is not being kept fresh enough to reassume authority, rollback is already decaying.
This phase is often more dangerous than steady-state dual running because correctness defects have now crossed the line from observed differences to live business behavior. Once real traffic hits the new path, mismatches stop being observational. They become user-visible defects, downstream side effects, or audit problems. If your reconcile backlog represents four hours of unresolved ambiguity, your rollback window is not four hours. It is smaller, because some of that traffic has already escaped into systems that will not roll back cleanly with the primary store.
A cutover can succeed technically and still leave you less correct than you were an hour earlier. That is one of the nastier parts of these migrations. Success at the routing layer does not mean the system has converged on safe truth.
Phase five is full cutover and cleanup. Teams often talk about this as the easy ending. It is frequently where the real truth shows up. Cleanup reveals which “temporary” reconciliation jobs were actually carrying correctness, which mismatches had been dismissed as noise, and whether the old path can truly be retired. The most revealing cleanup question is often whether anyone is willing to turn off the reconcile jobs. If not, the migration is not actually finished.
At small scale, dual write and shadow read can be almost boring.
A service doing 200 RPS, with 30 writes per second and 5 million rows, can often run a bounded backfill, attach version guards, log mismatches, and manually inspect outliers. Shadow reads on 5 percent of traffic mean 10 extra queries per second. A reconcile queue receiving one or two unresolved writes per minute is annoying, not existential. Safe migration capacity stays close enough to serving capacity that rough planning still works.
At 10x scale, the architecture stops being about write fanout and starts being about control of evidence.
Now the service is doing 2,000 RPS, 400 writes per second, and a backfill of 800 million records. Suppose the backfill runs at 10,000 rows per second. That is roughly 22 hours of copy time before throttling, retries, index maintenance, or compaction pressure. During those 22 hours, live writes are still hitting hot keys, the new store is paying write amplification from both backfill and dual write, and the reconcile path is accumulating exceptions. The migration is no longer limited by raw ingest speed. It is limited by whether freshness, comparator quality, and repair stay bounded while ingest continues.
At larger scale again, say 15,000 read RPS, 3,000 writes per second, multi-region traffic, and cutover spread across tenant cohorts, two things usually get worse first. One is freshness control. Backfill lag, stream lag, and reconcile lag combine into an uncertainty window that gets harder to reason about than most teams admit. “Mostly caught up” is a dangerous phrase. The other is comparison fanout. A shadow read on 10 percent of 15,000 RPS is 1,500 extra queries per second, which can distort cache behavior, saturate the new cluster during warmup, or become expensive enough that the sample rate gets cut right when confidence matters most.
What stops scaling cleanly first is usually not serving throughput. It is divergence detection and repair, followed closely by the ability to backfill without damaging live traffic. That is the point where the migration stops feeling like a design exercise and starts feeling like queue management plus operator stamina.
This is one reason full dual write plus full shadow read is overkill unless the risk is both high-impact and hard to bound in a narrower way. Plenty of smaller migrations are better served by scoped dual write, staged tenant cutover, stronger invariants, and forward-only repair.
Capacity planning for these migrations should start with side effects, not steady-state averages.
A dual write doubles the destinations, but it does not simply double cost. It changes tail behavior. If the old store is 8 ms p95 and the new one is 15 ms p95, the application path inherits a more complicated latency shape. Even if the old store remains authoritative and the new write is asynchronous, you have created pressure on queues, retries, and replay backlogs.
Suppose a service writes 1,200 mutations per second, with an average serialized payload of 2 KB. Steady-state live mutation volume is about 2.4 MB per second. On paper, that looks harmless. Add migration load and the picture changes fast: dual write doubles the mutation path, backfill pushes 15,000 rows per second at 1.5 KB each, shadow comparison emits 200 result diffs per second, and audit trails persist request context for every partial failure. The exact numbers vary by system. The useful point is simpler. Safe migration capacity is governed by overlap load, evidence load, and repair load, not by steady-state serving averages.
Live dual writes and historical backfill are different scaling problems. Live dual writes are latency-sensitive and correctness-sensitive. Backfill is throughput-sensitive and freshness-sensitive. Treating them as one aggregate ingest number is a planning mistake. A target system may handle 30,000 writes per second in synthetic testing and still fail this migration because 10,000 of those writes are backfill updates against cold ranges, 3,000 are live writes against hot ranges, and compaction or secondary-index maintenance makes the mixed workload far uglier than the benchmark.
Backfill duration matters because it stretches the overlap window. If 800 million rows take 22 hours to copy at 10,000 rows per second, then 10x data volume turns into about nine days before retries or throttling. That does not just extend the schedule. It extends the period where schema drift, code drift, replay backlog, and rollback assumptions can decay.
Shadow-read capacity has the same trap. Comparison throughput is not comparison usefulness. A team might proudly handle 3,000 shadow queries per second and store every diff. That proves the pipeline scales mechanically. It does not prove the evidence is good. If most diffs are benign ranking variation, pagination noise, or formatting mismatch, the pipeline is mostly manufacturing audit exhaust. Mismatch rate is not portable across domains either. A class of semantic difference that is tolerable in search can be a stop-the-line problem in a ledger.
That evidence pipeline is often sized too late. If each mismatch record stores request metadata, hashes, comparator output, snippets of old and new results, and replay context, 10 KB per record is not unusual. At 30 mismatches per second, that is about 26 GB per day. At 300 mismatches per second during a bad cutover hour, the audit trail alone can exceed 250 GB per day. Teams provision the target store carefully, then hand-wave the mismatch store and discover the real bottleneck is not serving capacity but the ability to retain and query migration evidence during the exact incident where they most need it.
Safe migration capacity usually fails first in evidence interpretation and repair, not at the primary datastore.
The first trade-off is continuity versus ambiguity.
Dual write buys continuity. The system keeps serving while the new path accumulates state. That benefit is real. The cost is more mutation outcomes, more retries, more repair work, and more ways for truth to drift. If the data is financially sensitive, externally committed, or order-dependent, I would rather give up some continuity than accept a dual-write design with weak replay semantics. A graceful migration is not worth much if the cleanup involves uncertain balances, duplicate commitments, or unclear final state.
The second trade-off is comparison breadth versus comparison meaning.
A wide shadow-read sample feels safer. Often it is not. I would take narrower traffic coverage with strong semantic comparison over broad coverage with weak equality checks. A 2 percent sample that proves invariants, filter correctness, and ranking sanity is more valuable than a 20 percent sample that mostly counts syntactic mismatches. Broad comparison with a weak comparator is not harmless. It trains the team to ignore the evidence stream.
The third trade-off is rollback flexibility versus prolonged coexistence.
Keeping the old path alive and fresh makes rollback more credible. It also extends the period where two systems must be reasoned about. Temporary schemas linger. Debugging requires looking at both worlds. Operators start memorizing which mismatches matter and which ones are “normal,” which is usually a bad sign. The longer coexistence lasts, the more likely rollback becomes a comfort story rather than a safe recovery path.
The fourth trade-off is migration completeness versus bounded scope.
Not every migration deserves the full pattern stack. A storage-engine replacement behind a clear service boundary may justify dual write with strong invariants and no shadow read. A search migration may justify shadow read with no dual write at all. A service decomposition may be better served by staged ownership transfer and event replay instead of universal dual mutation. Senior engineers are suspicious of playbook purity because more machinery also means more failure surface.
A final trade-off is elapsed time versus control. Teams often push hard to shorten the migration, then consume the spare capacity and operator attention rollback would have needed. Fast migration is not always safer migration.
Diagram placeholder
Silent Divergence, Cutover Success, and Recovery Amplification
Show one realistic failure chain where the migration remains traffic-healthy while truth drifts, then cutover or rollback amplifies the damage.
Placement note: Place near Common Failure Modes or Blast Radius and Failure Propagation.
The most important failure mode is silent divergence.
An old store accepts a write that the new store rejects because the new schema is stricter. The application returns success because the old path still defines truth. The reconcile queue records a mismatch, but it is categorized as transient. During cutover, the new system looks healthy for most traffic. Weeks later, a subset of accounts is missing fields that were silently dropped during migration.
That is the expensive version of this problem. The migration stays traffic-healthy while truth is already wrong. By the time support notices, the question is no longer “why did this write fail?” It is “which downstream systems already trusted the wrong state, and how much of that trust is now audit-relevant?”
Another common failure is backfill racing live writes. Historical row copied at time T, live update arrives at T plus one second, retry of the backfill chunk lands after the live update, fresher state overwritten on the target because version ordering was not enforced. This kind of bug looks impossible until you remember the migration is made of independent subsystems with independent retry behavior.
Shadow-read failure can be subtler. A search migration compares top-k overlap and latency. The numbers look good. After cutover, support finds filtered queries exposing documents that should have been suppressed. The comparator validated ranking similarity and missed authorization semantics. That is not just a search bug. It is a validation bug that made the migration look safer than it was.
There is also volume-induced blindness. A team can handle 200 mismatches per hour when each one is understandable. The same team becomes ineffective at 20,000 mismatches per hour, even if the percentage looks small. Once mismatch volume exceeds classification capacity, the pipeline stops being protective and starts being decorative.
Warning: rollback is fake unless the old path can still carry authoritative truth, not just traffic.
The old path is often declared ready as a fallback because the code still exists and read latency still looks fine. But during partial cutover, only the new system may have received a class of derived writes, or the old system’s indexes, caches, or secondary tables may no longer be maintained at production freshness. When rollback happens, availability returns and correctness drops. Technically, the rollback worked. Operationally, it restored the wrong system and created a second, less visible incident.
Recovery can be more dangerous than the original failure onset because it spreads confidence faster than it restores truth. Once people believe the system is back, fewer eyes stay on the places where the data is still drifting.
Cleanup-phase surprises deserve their own category. Teams discover that “temporary” repair jobs were still fixing live discrepancies, or that deleting the old compare path removes the only visibility into a class of write-order issues. The migration is finished on paper while truth is still being negotiated in cron jobs.
Migration failures propagate differently from ordinary outages.
A normal outage is time-bounded and obvious. A migration failure often spreads by contamination. One bad write-order decision creates incorrect state in the new store. Downstream indexes, caches, analytics views, and support tools consume that state. By the time divergence is detected, it is no longer just between old and new databases. It has spread into systems that already trust the new one.
Service decomposition makes this worse. Suppose account state moves from a monolith store into a dedicated account service. During migration, the monolith still serves some reads, the new service handles others, and an event bridge keeps them aligned. A missed mutation does not stay local. It affects authorization checks, billing views, support dashboards, maybe even email policy decisions. The blast radius expands because more systems start trusting the new authority before it has earned that trust.
Scale makes this nastier in a less obvious way. Large migrations rarely fail as one dramatic event. They fail as a widening interval of ambiguous truth. Backlog grows, compare lag grows, replay delay grows, and more downstream systems ingest state before repair catches up. Repair scope expands faster than the original divergence. What started as one missed write can become a multi-system data-fix campaign.
That is why narrow cutover beats broad heroics. I would rather migrate tenant cohorts or feature slices than switch a mixed workload globally, assuming the architecture actually isolates those cohorts. If the failure mode is global ordering, shared indexing, or shared side effects, tenant slicing buys less than teams hope.
The operational burden is usually underestimated by people who think detection is the hard part.
Detection is table stakes. Count mismatches. Sample shadow-read deltas. Alert on lag.
The expensive part is figuring out what the evidence means while the system is still live, and doing it fast enough that the migration does not outrun the people watching it.
You need a mismatch taxonomy. Missing write, stale version, transform difference, schema rejection, read-semantic mismatch, auth mismatch, index lag, compare bug. Without classification, all you have is a pile of red counters.
You need repair semantics. Which side wins? Can the write be replayed? Must replay preserve original order? Are side effects safe to repeat? Is the mismatch real state error or just representational drift?
You need operator ergonomics. When on-call gets paged for a migration mismatch spike at 2:10 AM, the dashboard has to answer what changed, which cohorts are affected, whether user-visible correctness is at risk, and whether cutover should freeze. Most migration tooling answers easier questions.
You also need enough throughput in the human loop. Suppose a large migration produces 50,000 mismatch records per hour, but only 2 percent are actionable. If the tooling cannot cluster by root cause, suppress known-benign cases, and surface only the risky classes, operators lose the thread. The first thing that stops scaling is not storage. It is trust in the migration signal.
This is the part that feels worst in real operations. Not because the graphs are red, but because they stop being persuasive. The team starts carrying a migration that technically still works but no longer has clean evidence attached to it. That is usually when people cut scope, reduce sampling, or push the cutover through because nobody wants to live in the ambiguity anymore.
Which system defines truth in each phase? During backfill, during live dual write, during shadow read, during partial cutover, during rollback, and during cleanup. If this is not explicit, the migration is already less safe than it looks.
Authority also determines where capacity reserves must exist. If the old system is supposed to remain authoritative for rollback, it cannot be sized only for shrinking legacy traffic. It must be able to reassume real load.
The second question is comparability.
What evidence actually proves the new path is acceptable? Equality of rows, invariant preservation, acceptable staleness, ranking stability, filter correctness, authorization consistency, latency distribution, or business outcome parity. Different migrations need different proofs. Search comparability is not ledger comparability. Engineers often confuse “we can compare something” with “we can compare the thing that matters.”
The third question is reversibility.
Can the old path safely take authority again? Not just receive traffic. Safely define truth again. That means it still holds valid state, its derived structures are fresh enough, and the new path has not introduced irreversible semantics.
Reversibility also has a time dimension. As reconcile lag grows, schemas drift, and cutover cohorts widen, rollback gets less real even if the button still exists. The rollback viability window is often much shorter than the migration plan admits.
The fourth question is cleanup burden.
What temporary machinery did you create, and how hard is it to remove without losing safety? Reconcile jobs, dual-write adapters, comparison services, backfill guards, data-fix scripts, one-off dashboards, exception handling for known drift. A migration with low cutover risk and high cleanup burden can still be the wrong design if it leaves behind permanent correctness anxiety.
Junior engineers often focus on sequencing. Senior engineers focus on those four questions instead.
One common mistake is treating dual write like a universal sign of seriousness. It is not. Sometimes it is the right move. Sometimes it is the most expensive way to manufacture uncertainty.
Another is using shadow reads without honest comparison semantics, then treating mismatch counts like proof instead of raw input. If the comparator cannot detect the class of wrongness you care about, the exercise is theater.
Another is treating backfill like preparation instead of live correctness work. In many migrations, backfill causes more trouble than the cutover because it interacts badly with versioning, compaction, hot partitions, and schema evolution.
Another is believing rollback exists because the old code path still compiles. That is not rollback. That is nostalgia.
Another is sizing only for live traffic. Teams benchmark steady-state reads and writes, then discover the migration is limited by replay queues, compare workers, mismatch storage, or the ability to query audit history during an incident.
A subtler mistake is letting product change continue without migration-aware discipline. A schema change or behavior tweak that is harmless in a single-system world can make comparison meaningless or reversibility impossible in a migration world.
Another is keeping the migration machinery long after the move is supposedly done. Teams keep the compare jobs, replay scripts, and one-off dashboards around because nobody wants to lose the last remaining truth probes. Temporary safety tools quietly become permanent ambiguity tools.
This is overkill unless the migration has real correctness exposure, enough duration to accumulate divergence, or enough business impact that silent mismatch is unacceptable. For a low-value internal read model, a one-way backfill plus tenant canary may be the better engineering decision. Simpler is not immaturity. Sometimes simpler is better containment.
Use these patterns when the migration must happen live and correctness cannot be delegated to hope.
A database migration with continuous writes and no stop-the-world window is a classic case. Dual write may be justified if the write model is idempotent, ordering is enforceable, and repair is real.
A search or read-serving migration is a strong fit for shadow reads when the new system can be evaluated safely out of band and the comparison semantics are thoughtful.
A storage-engine replacement inside a service boundary can justify scoped dual write plus strong reconciliation when the service owns both representations and can bound side effects.
A payment or ledger migration may use parts of the pattern, but usually with narrower authority rules and stronger invariant focus. In those systems, I would optimize for unambiguous truth over migration convenience.
Use them when you can answer a harder capacity question honestly: not “can the new system serve production load?” but “can the whole migration system absorb live writes, backfill, compare traffic, replay, audit retention, and a bad hour of mismatch growth without losing control?”
Do not use full dual write plus shadow read just because the migration is important.
Do not use dual write when replay semantics are weak, ordering matters deeply, side effects are hard to deduplicate, or the target system cannot preserve a credible version or idempotency model. Money movement, irreversible workflow steps, and externally visible commitments turn sloppy dual write into expensive ambiguity.
Do not use shadow reads when the read paths are not meaningfully comparable. If one system computes fundamentally different aggregates or freshness windows, weak comparisons can create more confusion than safety.
Do not prolong coexistence only to preserve a comforting rollback story. If the old path is no longer being maintained as a valid authority, move forward with explicit repair strategy instead of pretending rollback is free. Forward-only repair is only a serious option when the business can afford losing rollback as the primary recovery path.
Do not use the full pattern stack for a modest migration where tenant canaries, one-way copy, strong invariants, and quick cutover achieve the same risk reduction with much less complexity.
Also avoid it when the migration support system would become larger than the thing being migrated, or when the architecture cannot isolate cohort-level failure even though the rollout plan pretends it can.
A junior engineer often sees a migration as a sequence. Backfill, dual write, shadow read, cut over, clean up.
A senior engineer sees it as temporary governance of truth.
That shift changes almost every design choice.
They ask where authority sits today and tomorrow. They ask whether the new system is being asked to prove too much too early. They ask what divergence the business can tolerate, for how long, and how it will be found before customers find it. They ask whether rollback is technically possible or actually safe. They ask whether the migration plan creates months of cleanup debt for the sake of a cleaner rollout story.
They also ask a less glamorous question that matters more than most architecture diagrams: what is the real limiting capacity of this migration? Not peak read QPS. Not synthetic write throughput. The real limit is usually the point where divergence arrives faster than it can be explained and repaired.
That is why senior engineers are more willing to choose an inelegant design if it reduces uncertainty. Narrow tenant cutover can beat a globally elegant migration. Strong invariant checks can beat broad comparison coverage. A forward-only repair strategy can beat an overstated rollback plan. A smaller, tighter migration is often the more senior choice.
Opinionated but defensible statement: most migration plans are too proud of their cutover design and not nearly suspicious enough of their coexistence period.
Dual write and shadow read are useful because they let systems change while traffic stays live.
They are dangerous for exactly the same reason.
The real work is deciding which path defines truth in each phase, what evidence is trustworthy, how divergence is repaired, and whether reversal would restore the right system or just restore traffic. At scale, the hardest limit is rarely serving throughput. It is the ability to backfill without damaging live traffic, detect divergence before it spreads, keep comparison useful under mismatch volume, and preserve a rollback path that still deserves to be trusted.
That is how senior engineers think about these migrations. Not as checklists. As controlled uncertainty with explicit truth boundaries.