Take a normal write.
An order service commits a transaction in Postgres. From the service team’s point of view, the work is done. The API returned 200. Latency looks fine. Nothing is on fire.
Inside Postgres, the change is durable in WAL. Debezium is not reading the table the way an application query does. It is reading logical changes derived from WAL. If it is using a logical replication slot, Postgres keeps the WAL needed by that slot until the connector advances.
This is where the clean CDC story starts lying by omission.
If Debezium is healthy, it reads the change, applies schema metadata, hands the record to Kafka Connect, and Connect publishes it to Kafka. Many teams mentally stop there because that is the boundary they built.
The business does not stop there.
Search still has to index it. Billing projections still have to absorb it. The cache invalidator still has to evict the old value. Analytics still has to land it somewhere people actually trust. A record in Kafka is not a customer visible state change. It is just one hop completed.
This is where debugging gets nasty. A user says the new order is not searchable. The connector dashboard is green. Kafka publish latency is fine. The database is healthy. One consumer group is drifting because the search cluster is slowing down under bulk update pressure. Nobody is fully wrong, which is exactly why these incidents burn time.
The first 30 minutes of a CDC incident are often spent proving the last thing that worked.
On a small system, that gap stays survivable for a while. Imagine 200 row changes per second, about 1 MB of WAL per second, one connector, and three modest consumers. If search falls behind for 15 minutes, people grumble and move on.
Now scale it up. Say the hot tables generate 20 MB of relevant WAL per second in steady traffic and 35 MB per second during peak. A connector starts falling behind for two hours because Kafka producer latency spikes and task restarts flap just enough to kill throughput. The primary is now retaining well over 100 GB of extra WAL. Search is 18 minutes stale. Billing projections are 25 minutes stale. Analytics is behind far enough that finance starts asking whether today’s numbers are real.
The write path is still green. The product is already in trouble.
This is the part teams learn late: “published” is a transport fact. “current” is a business fact. Those are different contracts, and the second one is the one people yell about.
Nothing teaches respect for CDC like a healthy primary, a green API, and a support team staring at stale search results.
If the outage lasts long enough to create real backlog, recovery becomes its own event. Suppose two hours of peak traffic leaves you with 160 million queued changes. The connector recovering is not the finish line. Search may only be able to safely absorb 45,000 updates per second while live traffic is still producing 20,000. Net catchup is 25,000 per second on a good day. That is more than 100 minutes of replay even before the destination starts choking on the recovery load.
Replay is usually where “eventually consistent” turns into “operationally expensive.”
CDC starts out feeling simple because the early version often is simple enough.
A few tables. One connector. One source database. A handful of consumers. Lag is visible but rarely scary. Restarting the connector fixes most things. Snapshots are annoying, not existential. Teams come away with the impression that CDC is a low friction way to externalize change.
Growth breaks that story in several specific ways.
Larger databases do not just mean more stored data. They usually mean more write throughput, more hot tables, more background maintenance, more migrations, and less forgiveness when a slot stops moving. Retained WAL pressure becomes sharper because the underlying change volume is sharper.
More captured tables increases more than payload size. It increases contract surface. Every new table brings schema assumptions, consumer expectations, migration risk, and one more place a rollout can fail in ways the application never sees directly.
More connectors are usually introduced for good reasons. Hot tables need isolation. Fragile tables need isolation. Different domains need different blast radii. That is the right move and it multiplies the operational surface area. More connectors means more offset state, more task assignment issues, more schema histories, more restart edge cases, more things to upgrade, and more places to misconfigure security or capacity.
Replay time gets much uglier with scale. At small volume, a 20 minute outage creates backlog you burn down quietly. At larger volume, the replay itself becomes a production event. Search clusters saturate under catchup writes. Warehouses ingest old and new load together. Side effecting consumers need gates because “replaying history” is not the same thing as “rebuilding state safely.”
What changes at 10x is not just graph height. The operating model changes. A system that was comfortable at 500 changes per second with loose alerts and informal recovery becomes operationally brittle at 5,000 if nothing else changes. Lag becomes debt faster. Retained WAL becomes dangerous faster. Schema mistakes hit more teams. Recovery goes from restart to campaign. Pager load shifts from occasional connector noise to a recurring platform tax.
The sentence “CDC is simple” is usually just a memory from before scale forced the hidden parts into the foreground.
Teams reason better about CDC once they stop flattening all of it into “database changes go to Kafka.”
Logical decoding is not publishing
Logical decoding is a source side mechanism. It lives in the world of slots, WAL retention, and database progress.
Publishing is a connector and transport mechanism. It lives in the world of Kafka Connect task state, producer retries, serializer behavior, topic configuration, and broker health.
Replication slots are not harmless cursors
A logical replication slot is also a retention contract. If the consumer does not advance, Postgres keeps WAL around. That is why a connector problem can become a source database problem even when the application write path is still serving traffic.
Snapshotting and streaming are different systems wearing the same name
Streaming is steady state tailing of new changes.
Snapshotting is bulk capture of prior state.
They stress different resources, fail differently, and require different runbooks. Teams often operate the streaming path reasonably well and discover during recovery that their snapshot story is operational fiction.
Schema capture is not schema safety
Debezium can observe schema changes and serializers can enforce compatibility rules. Neither one guarantees that your downstream estate can actually survive the meaning of the change.
A migration can be perfectly valid for the database and deeply unsafe for the CDC estate.
Published is not current
A change in Kafka proves one boundary worked.
It says nothing about whether the sink applied it, whether the sink is fresh enough for the business, or whether users are already acting on stale derived state.
CDC is a good choice only when the organization is prepared to operate freshness, replay, and schema rollout as platform concerns.
It removes a class of brittle synchronous fanout from the write path. It creates a shared stream of committed change for many consumers. It can reduce application level correctness risk compared with ad hoc dual writes.
But the price is not theoretical. You are choosing to run a live change propagation subsystem with source side storage implications, rollout coordination burden, lag management, replay cost, and cross team debugging overhead.
That trade is worth it when the platform can support it.
Otherwise it is a delayed complexity purchase.
Real CDC incidents rarely arrive as “CDC is down.” They arrive as sideways damage. Search is stale. Billing is off. Analytics is behind. The database team is asking why WAL will not clear. The connector is technically alive. Everyone has one reassuring graph and one ugly graph.
That is why the failure modes matter. Not as taxonomy. As incident memory.
Connector falls behind and the slot retains WAL
Early signal
End to end lag starts drifting up. Source commit to Kafka publish goes from sub second to several seconds, then several minutes. Connector restarts may increase. Producer retries may climb. On hot systems, retained WAL starts growing almost immediately.
What the dashboard shows first
Usually a connector lag graph, task instability, or one freshness panel moving the wrong way. On weaker dashboards, the first visible signal is actually a downstream team complaining that something is stale.
What is actually broken first
Forward progress relative to source write volume. The connector does not have to be hard down. Slightly slower than the source is enough if it lasts long enough. Once that happens, lag turns into retained WAL, and retained WAL turns a downstream problem into a source problem.
Immediate containment
Protect the source first. Get the slot moving again. Recover Kafka if Kafka is the bottleneck. Isolate the hot table if one table is poisoning the connector. Slow any bulk source churn you can slow. This is not the moment to optimize elegance. This is the moment to stop turning disk into stress.
Durable fix
Connector isolation, throughput tuning, better partitioning, capacity that matches actual write rates, and explicit WAL headroom policy on the source. One large shared connector that felt tidy at design time often becomes the thing that spreads pain fastest in production.
Longer-term prevention
Monitor retained WAL growth rate, not just connector aliveness. Alert on source to destination freshness, not only source to Kafka delay. Rehearse what gets paused first when a slot stops advancing. Most teams do not have that answer ready until the first bad incident teaches it to them.
A connector can be “mostly healthy” for an hour while the primary is quietly turning a streaming problem into a storage problem.
Schema change breaks downstream parsing
This one is common enough that teams still act surprised every time.
Early signal
Consumer parse failures start rising on one stream. Dead letter volume grows. One sink stops advancing while others stay fine. Sometimes the first real signal is a downstream discrepancy, not an infrastructure alarm.
What the dashboard shows first
Usually a sink side failure. Search rejects documents. Warehouse ingestion starts erroring. Billing projections stop advancing. The connector itself may still look green because it is successfully publishing the new shape.
What is actually broken first
The contract between source schema evolution and downstream expectations. The migration may be valid for Postgres. It may even be valid for the application. That does not mean it is safe for the CDC estate.
Captured tables accumulate hidden readers. Teams remember the obvious consumers and forget the internal job, the support projection, the one warehouse transform, the billing side table no one has touched in months. Then a migration lands and suddenly the database change was “safe” right up until the system that mattered stopped parsing it.
Immediate containment
Stop the blast radius. Pause the broken sinks. Quarantine the stream if needed. Roll back the migration only if rollback is actually safer than forward repair. Sometimes it is not. Sometimes the least bad option is to patch the downstream parser and preserve backlog for controlled replay.
Durable fix
Treat schema changes on captured tables as distributed rollouts, not local database edits. Compatibility testing has to include real consumers or faithful contracts, not just the migration script and the app.
Longer-term prevention
Require CDC review for migrations on captured tables. Keep a live map of who reads what. Use canaries for schema rollout. Stop using “backward compatible” as a magic phrase when what you mean is “the app still works.”
The migration is never just a database change once the table is on CDC. It is a rollout across every team that quietly attached itself to that stream.
Snapshot or replay duplicates old data into consumers
This is where teams discover whether their consumers were actually designed for reality.
Early signal
Counts drift upward in one destination without matching source growth. Duplicate notifications appear. Billing side projections wobble. Search update rates spike in ways traffic does not explain.
What the dashboard shows first
Usually sink side anomalies. The connector may look busy but healthy. Kafka may look fine. The weirdness shows up where replay meets a consumer that was never truly safe to see old data again.
What is actually broken first
Not the fact that old records reappeared. Replay is allowed. What broke first is the consumer’s assumption that seeing the same source fact again is harmless when it was actually wired to trigger a side effect or double count.
Immediate containment
Pause side effecting consumers first. Let replay safe projections continue if possible. Do not let “we can always replay” turn into “we just emailed everyone twice” or “we just rebuilt billing wrong.”
Durable fix
Idempotency at the sink boundary. Durable dedupe keys. Clear separation between state rebuilding consumers and side effecting consumers. Replays should be boring for some consumers and gated for others. If everything is treated the same, recovery becomes guesswork.
Longer-term prevention
Practice replay before you need it. Label consumers by replay safety. Know which ones can run through a backfill and which ones require operator control. Most teams only learn this after the first replay creates more fear than confidence.
The expensive part of replay is not that the data still exists. It is that the system has to survive seeing it again.
One downstream dependency is stale while others are current
This failure shape wastes a lot of time because it makes the middle of the stack look innocent.
Early signal
One freshness SLO drifts while connector and Kafka graphs stay green. Search gets slower to reflect changes. Billing projections age out. Analytics freshness slips.
What the dashboard shows first
A destination specific backlog. Usually only one. That is why teams keep saying “CDC is healthy” while the business is already dealing with stale systems.
What is actually broken first
The destination apply path. Not the connector. Not the database. Not Kafka. The broken thing is the path people actually use to observe or act on the data.
Immediate containment
Contain the business impact, not the architecture diagram. Route critical reads back to the source if you can. Pause low priority sink work. Mark dashboards degraded instead of letting stale numbers keep pretending to be fresh.
Durable fix
Per destination freshness monitoring and capacity planning. Search freshness is its own contract. Billing projection freshness is its own contract. Analytics freshness is its own contract. One generic CDC health panel is operationally lazy.
Longer-term prevention
Measure source commit time to destination visibility time for every sink that matters. That is the contract. Not “messages published.”
The first dashboard that matters is often not the connector dashboard. It is the support queue.
Teams think CDC is optional until it stops billing, search, or analytics freshness
This is the organizational version of technical debt maturing.
Early signal
More teams quietly start depending on CDC outputs. Support assumes search is near real time. Finance uses warehouse tables operationally, not just analytically. Billing workflows stop querying source of truth tables directly. No one updates SLOs or ownership to match.
What the dashboard shows first
Usually ordinary lag. Nothing about the graph itself says the system is now important. That is the trap.
What is actually broken first
The mental model. The organization kept treating CDC as optional after the business had already promoted it into required infrastructure.
Immediate containment
Be honest about degradation. If stale CDC is now blocking billing freshness or order search, declare those business functions degraded even if the main write path is still green.
Durable fix
Reclassify CDC dependent systems by business criticality. Fund the monitoring, on-call posture, and destination specific freshness objectives that classification implies.
Longer-term prevention
Review CDC consumers periodically. Ask which ones became operationally important while nobody was paying attention. This happens more often than teams want to admit.
Ownership ambiguity slows incident response
This is one of the most common reasons CDC incidents stay expensive.
Early signal
Confusion, not a graph. Database sees retained WAL. Platform sees connector flapping. Search sees stale results. Data sees warehouse lag. Product sees wrong numbers. Everyone is bringing valid local evidence into a room that has no global owner.
What the dashboard shows first
Depends who is looking. That is part of the problem.
What is actually broken first
The operating model. The technical issue might be small or large, but the incident gets longer because nobody owns source to destination freshness as a single contract.
Immediate containment
Pick one incident lead over the whole chain. Define the priority clearly. Protect source headroom. Restore the most business critical freshness path. Prevent unsafe replay. Do those in the right order.
Durable fix
Explicit ownership for end to end CDC health. Not just “the connector team” and “the sink team.” Someone has to own freshness across the boundaries.
Longer-term prevention
Cross layer dashboards. Shared runbooks. Joint game days with database, platform, and major sink owners. CDC crosses boundaries by design. The response model has to cross them too.
Most CDC incidents are not mysterious. They are cross boundary. That feels similar when nobody owns the whole path.
Blast Radius and Failure Propagation
CDC failures spread sideways before they spread downward.
That is why they are so annoying to debug.
A connector issue shows up as stale search. A schema change shows up as broken analytics. A replay problem shows up as duplicate customer communication. A lagging slot shows up as database disk pressure. The failure path crosses source, transport, destination, and business surfaces, and the first symptom the company notices is rarely the place the actual break began.
This is why the on-call burden is higher than teams expect. The system crosses more domains than the original architecture review admitted. One incident can require a database operator, a streaming operator, and an application owner before anyone can even agree on the shape of the problem.
CDC is not especially painful because any one component is impossible. It is painful because the debugging path is cross boundary by default.
Operational Complexity
This is where teams usually underbuild and overestimate themselves.
A serious CDC setup needs more than a connector that stays up and Kafka topics that keep moving. It needs an operating model.
At minimum, teams need to see:
slot lag in bytes and time
retained WAL bytes by slot
retained WAL growth rate
connector task state and restart frequency
source commit to Kafka publish delay
source commit to destination visibility delay
dead letter growth
schema rollout errors by consumer
backlog size by sink
replay burn down rate under live traffic
consumer classification by replay safety
Most teams do not monitor half of that.
They usually do not watch retained WAL growth rate early enough. They watch absolute lag and miss the moment it turns dangerous. They usually do not watch destination freshness. They watch connector health and Kafka offsets and miss that one sink the business actually cares about is already stale. They usually do not watch schema rollout health across consumers. They find out from parse failures after the change is live. They usually do not know replay time under live traffic. They know replay is theoretically possible, not operationally tolerable.
Many CDC incidents are prolonged by green middle layer dashboards. Connector up, Kafka moving, primary healthy, and the only graph that matters is the one nobody built: source commit to destination visibility.
This is why the on-call burden is heavier than the diagram suggests.
A CDC incident is rarely one question. It is several bad questions at once.
Is the source safe?
Is the connector actually advancing?
Is Kafka the choke point?
Which sink is stale first?
Can we replay safely?
Who owns saying the business is already degraded?
That is real operator load. Not just system load.
Schema rollout pain deserves special respect. Once a table is captured, a migration on that table is no longer a local change. It is a distributed rollout across every parser, transform, projection, and quiet internal dependency attached to the stream. The database migration can succeed in minutes and still buy the team two days of downstream cleanup. That is why CDC makes seemingly ordinary schema changes feel more expensive over time. The readers multiply. The memory of them does not.
Replay and backfill cost are usually underestimated in a more subtle way. Teams think of replay as data movement. It is really a production event. It consumes sink capacity that live traffic was already using. It forces judgment calls about throttling. It exposes weak idempotency. It raises ugly questions about whether you trust the old data, the old schema, and the old consumers enough to run them back through the system. A multi hour replay is not background work. It is a plan.
Teams usually do not fail because the connector dies. They fail because they never built the machinery around it.
No freshness SLO by sink.
No replay budget.
No source WAL headroom policy.
No consumer map for schema rollout.
No shared incident view from source commit to destination visibility.
Then the first serious incident arrives and the team discovers they built a pipeline but not an operating model.
The real cost is not just the outage. It is the permanent tax of tracking freshness, reviewing schema changes, planning replay, and keeping several teams aligned on a path nobody wanted to call critical.
They keep talking about CDC as an integration convenience after it has already become production machinery.
They watch connector health and not destination freshness.
They underweight retained WAL until it becomes a database event.
They let one connector span too much blast radius.
They treat schema changes as local application events instead of CDC estate rollouts.
They assume replay is cheap because the data still exists.
They do not classify consumers by replay safety and side effect risk.
They leave ownership fuzzy until the first real incident turns “shared responsibility” into “nobody is driving.”
Use CDC when the database is the real source of truth, multiple downstream systems need the same committed changes, and the organization is willing to operate the resulting machinery seriously.
It fits best when synchronous fanout would be too risky in the write path, polling is too blunt, and there is enough platform maturity to own lag budgets, rollout discipline, replay planning, and cross team response.
Do not use CDC because it sounds cleaner than talking to other teams.
Do not use it for a trivial one off integration that could be handled more explicitly.
Do not use it when no one owns source side safety, destination freshness, or replay discipline.
Do not use it when the organization has no appetite for schema coordination and hidden consumer management.
Do not use it when everyone likes the decoupling story and nobody wants the pager.
Senior engineers treat CDC the way they treat anything that can quietly become business critical.
They assume the easy phase is temporary.
They ask what must be monitored before they ask what can be built.
They ask what the replay plan is before they trust the happy path.
They ask which schema changes are operationally dangerous before someone ships one on Friday.
They ask who owns end to end freshness before the first incident makes the question urgent.
They do not let the conversation stay at “we publish database changes.” They pull it toward slot safety, rollout pain, replay economics, destination freshness, and on-call burden because those are the parts the business eventually pays for.
And they recognize the scar tissue pattern early.
The connector was “healthy” right up until the moment the primary started retaining enough WAL to scare the database team.
The migration was “backward compatible” right up until the sink that nobody remembered stopped parsing it.
The replay was “available” right up until someone realized billing could not safely consume it twice.
That is not cynicism. That is what production memory sounds like.
CDC is useful. Sometimes it is the right answer.
But it is not lightweight, and it is not free. It creates a long lived operational path between source truth and downstream behavior. That path has bottlenecks, failure modes, rollout pain, replay cost, and pager load of its own.
The mature operations view is simple.
Treat slot progress as source safety.
Treat schema rollout as distributed change management.
Treat replay as a production event, not a technical footnote.
Treat destination freshness as the contract that matters.
Treat ownership ambiguity as a failure mode, not an org annoyance.
CDC does not remove integration work. It converts it into ongoing operational work, then charges interest during lag, replay, and incident response.