Core insight: Retries are usually sold as resilience. In production they behave more like demand multiplication, and often at exactly the wrong moment.
The incident starts the way many bad incidents start: not with a hard-down dependency, but with a slow one.
An internal payments service normally returns in 80 to 120 milliseconds. During a storage failover, median latency stays under 200 milliseconds, but tail latency stretches past 2 seconds. Availability is degraded, not absent. Most of the fleet sees this as a transient problem and responds exactly as configured.
The API tier uses a 700 millisecond timeout and retries up to 3 times. A downstream order service also retries the payment call twice on timeout because its team added “extra resilience” months earlier. The client SDK in one mobile path retries once more on connection reset. A queue worker that handles delayed order-finalization messages also retries payment failures on a separate schedule because it was built to survive brief gateway flaps. None of those policies looked reckless in isolation.
That is how these incidents get purchased. Nobody approves “triple the load on the weakest dependency.” They approve a handful of locally reasonable retry decisions made months apart by different teams.
At 7,000 original payment attempts per second, the system is busy but healthy. Then the slowdown begins. A meaningful fraction of first attempts crosses the 700 millisecond timeout, so the API tier issues second attempts while the original attempts are still executing. Some of those second attempts also time out and produce third attempts. The order service, sitting above the same call path, interprets the API timeout as another transient and adds its own retries. Queue workers, seeing timeouts in the reconciliation path, begin replaying the same logical business intents from a different channel. Within less than a minute, the payment service is no longer handling 7,000 intents per second. It is seeing something closer to 18,000 to 24,000 attempts per second, much of it duplicate work.
Nothing about customer demand changed. The load increase is self-generated.
The scale detail that usually gets missed is how quickly the same retry policy changes character as the caller population grows. With 5 API instances sending 40 requests per second each, a brief slowdown may create an ugly but survivable burst. With 300 instances each sending 250 requests per second, the exact same policy becomes a fleet-wide amplifier. Retries stop being a local correction and become synchronized pressure from hundreds of workers that all observed the same latency event within the same second.
A smaller system makes the contrast clear. A dependency can comfortably handle 300 requests per second at 100 millisecond latency. Six caller instances send 30 requests per second each, so original load is 180 requests per second. If the dependency slows to 2 seconds, concurrency rises from about 18 in flight to about 360. That is ugly, but maybe survivable if pools and queues still have headroom. If callers retry once after 500 milliseconds, offered load may jump toward 250 or 300 attempts per second.
Now scale the same shape up. The larger system has 240 stateless API pods and 80 worker processes behind queues. Original intent volume is 8,000 requests per second. The payment service normally runs at 120 millisecond latency, so average in-flight work is around 960 requests. When latency stretches to 2 seconds, in-flight work for the same demand becomes about 16,000 requests before retries even start. Add two retrying layers and it is no longer difficult to drive 20,000 to 30,000 concurrent attempts through the same connection pools, database sessions, and worker queues.
At that point the incident is no longer “payment is slow.” The incident is that the system has chosen to spend scarce capacity on overlapping attempts whose answers may not matter anymore.
Connection pools fill first. Then worker queues lengthen. Then CPU rises, but not as dramatically as people expect because the hotter resource is not compute. It is occupancy. Requests are living too long. Database sessions stay open. Lock hold times stretch. Thread pools stop turning over. The queue consumer fleet keeps polling because from its point of view the problem looks like partial failure, not systemic overload. Autoscaling notices rising latency and queue depth and adds more pods, but the new pods start cold, open new connections, warm caches, and join the same retry policy. They arrive after the dependency has already crossed the point where more callers simply mean more duplicate pressure.
This is where incidents stop feeling technical and start feeling stubborn. Everything still looks partly alive.
The dependency that was merely slow is now effectively dead.
And then the incident gets worse in the familiar way. The dashboard shows payment latency exploding, but the order service starts failing too because its threads are now tied up waiting on payment retries. Checkout latency rises. Shopping cart abandonment looks like a frontend issue for ten minutes. A delayed email worker starts replaying “payment succeeded” and “payment failed” transitions based on ambiguous state. Support sees duplicate authorizations on a small fraction of orders because some timed-out payment calls had actually committed before the caller retried. Inventory for a few hot items looks reserved twice because the reservation service accepted a duplicate intent before dedupe state converged.
The original technical event was a brief storage failover. The customer-visible outage was retry amplification plus ambiguous completion.
This is the part most retry advice understates: retries do not merely repeat work. They reshape demand right when the system most needs demand to fall.
Retries fail this way because three things line up at once.
First, a slow dependency already needs more concurrency for the same demand. If latency stretches from 100 milliseconds to 2 seconds, each request lives 20 times longer. Even before retries, the system is burning scarce capacity faster than it can recycle it.
Second, retries are launched from local viewpoints, not fleet-wide ones. A timeout may be transient for one caller. Across hundreds of callers, the same decision becomes correlated extra load. When multiple layers make that choice independently, local prudence becomes aggregate panic.
Third, much of the extra work is low-value or zero-value. Some attempts overlap with still-running originals. Some arrive after the caller’s useful deadline is already gone. Some come from queue redelivery or worker replay and are not even counted as retries in normal reasoning. Capacity is spent, but the extra work does not buy proportional success.
That is why retry storms feel unfair. The dependency was weak, but the outage size was negotiated by a pile of independent “reasonable” choices, hidden defaults, and clocks that all happened to agree that now was the right time to help.
The useful taxonomy is not “safe retries” versus “unsafe retries.” That framing is too soft to help in an incident.
Recovery-window retries are the ones that usually earn their keep. The next attempt may genuinely see a healthier state because the fault is short-lived and structural: leader election, route convergence, a connection reset during deploy, a brief control-plane flap.
Slowdown retries are the dangerous middle. The dependency is not down. It is late. These retries compete with the original work for the same constrained resource. They do not wait for recovery. They add pressure to a system that is already failing by taking too long.
Deadline-dead retries are attempts launched after the caller’s useful budget is effectively gone. They may succeed technically and still be waste. Worse, they often steal capacity from fresher requests that still had a chance.
Cross-layer retries happen when different layers retry the same logical intent without a shared budget or a shared owner. These incidents are confusing because every team can point to a modest local policy and still the combined system behaves like a panic machine.
Replay retries come from queues, workers, redelivery, dead-letter drains, and compensators. Teams often do not count them as retries at all. In many systems the most damaging retry is not the HTTP retry in the hot path. It is the worker replay arriving minutes later against a dependency that never fully recovered.
Ambiguous-completion retries are where idempotency stops being a design virtue and becomes the only thing standing between a transient fault and business damage. The caller timed out, but the server may have committed.
One thing engineers usually get wrong is assuming idempotency solves the retry problem wholesale. It solves one part of the problem: duplicate side effects within a defined boundary. It does not solve overload, queue growth, pool exhaustion, or useful-deadline violation.
The first mistake is usually not “we configured retries.” The first mistake is “we configured them without a theory of failure.”
And real incidents rarely start with a clean timeout graph. They start with a messy mix: slightly elevated latency, a small fraction of 5xx responses, some connection resets, some cancellations, maybe a few throttles. Different client libraries classify these differently. One team retries on timeout only. Another retries on 502 and 503. Another retries connection reset but not read timeout. Another language SDK suppresses retries when request bodies are non-bufferable.
Real retry incidents are policy archaeology exercises.
The standard timeout math still matters, but mostly because it exposes how little thought often went into the full attempt lifecycle. A service has a 600 millisecond per-attempt timeout, 3 retries, and backoff intervals of 100 milliseconds, 200 milliseconds, and 400 milliseconds. On paper, this sounds bounded. But the caller itself has a 1.5 second end-to-end deadline. The third attempt was never going to be useful. During degradation it becomes worse than useless because it consumes downstream capacity for work whose answer cannot matter anymore.
Now add a server-side queue. The first attempt reaches the downstream and waits 450 milliseconds in queue before doing 250 milliseconds of work. The caller times out at 600 milliseconds and sends a second attempt, which enters the same queue while the first attempt is near execution. One logical intent has become two queued units of pressure before either has resolved.
At larger scale, this starts even earlier because multiple retrying actors join the same incident. The mobile SDK retries once on connection reset. The API service retries on timeout. The background compensator retries failed state transitions from a queue. A batch worker also retries because it saw a 503. None of these actors knows the others are helping.
The early signal here is rarely a dramatic spike in outright errors. More often it is a gentle rise in tail latency, connection acquisition time, queue age, or attempts per original request. What the dashboard usually shows first is “degraded but manageable.” What is actually broken first is turnover. Pools stop clearing at the rate the system assumes.
Once turnover breaks, retries stop being defensive and start becoming predatory.
Retry amplification spreads through occupancy first, then through dependency coupling, then through correctness.
The first propagation path is concurrency saturation. Slow requests hold sockets, threads, semaphores, database sessions, and memory buffers longer. Retries increase the number of in-flight attempts competing for the same finite pools. Once those pools saturate, even healthy requests are dragged into long waits. This is why retry storms destroy tail latency long before they necessarily spike average CPU.
The second propagation path is queue contagion. A queue that was useful during steady-state variance becomes toxic during incident amplification. Duplicate attempts crowd out fresh work. Old attempts whose callers have already given up still sit in the queue. New customer requests arrive behind work that is already economically dead. Tail latency rises, then timeout rate rises, which creates more retries, which deepens the queue further.
The third propagation path is upstream collapse. A service waiting on a weak dependency accumulates blocked worker time. Its own latency grows. Its callers now see it as the failing dependency and may start retrying it too. What began as a local slowdown turns into a multi-hop outage because every layer exports its waiting problem upward.
This propagation accelerates sharply at scale because the pools that matter are usually small relative to caller count. A payment service may be called by 300 pods but have a database pool of 120 connections per cluster shard and a worker pool of a few hundred active slots. Once retries push occupancy above those bounds, the service stops failing gracefully and starts queueing almost everything. The next layer up then sees a broad latency wall rather than a partial degradation, which triggers its own retry logic.
The blast radius widens not because the original fault grew, but because waiting became contagious across layers.
Here is the failure chain teams actually live through:
A storage failover adds 1.5 to 2 seconds to a dependency’s tail latency.
Callers with 700 millisecond timeouts retry while the original attempts are still running.
Connection pools on the dependency exhaust and thread pools stop turning over.
Queue age rises because worker completions slow, which triggers more redelivery and worker retries.
The upstream order service now sees its own latency rise and starts retrying payment calls from a higher layer.
Autoscaling adds more API and worker pods, which open more connections and join the same retry schedule.
The payment dependency begins returning more timeouts and partial 5xx responses.
Some original attempts still commit, but their callers have already retried.
Customers see spinning checkout flows, some duplicate charges, some “payment failed” messages after successful authorization, and intermittent inventory holds that do not match final order state.
If you have operated one of these systems long enough, you eventually learn an ugly rule: the queue is often where the outage goes to become patient.
What the dashboard shows first in this chain is rising p99 and maybe a modest increase in request rate. What is actually broken first is concurrency economics. The system has started spending capacity on overlapping attempts whose answers may no longer be useful.
This is the failure-propagation lesson teams usually learn the hard way: retries are not just additive at the point of failure. They teach every upstream service to wait longer, hold more resources, and send more work.
The final propagation path is correctness damage. A non-idempotent side effect, such as charging a card, sending an email, reserving inventory, or mutating workflow state, may have succeeded just before the timeout boundary. The caller does not know that, so it retries. Now the same customer intent may be executed multiple times. The outage is no longer just about latency or availability. It is about business truth.
Idempotency narrows that risk, but only if the boundary is real. “This endpoint is idempotent” is usually an overstatement. The real question is whether the same logical intent is recognized and deduplicated across every relevant side effect, within a defined time window, despite partial failure and asynchronous processing.
The early signal here is ambiguous state, not always high error rate. A support tool shows both “payment pending” and “payment failed” for the same order. An email system shows duplicate send attempts. Inventory reservations exceed confirmed orders by a small but rising margin. The dashboard may still make this look like an availability problem. What is actually broken first is business identity. The system no longer has a trustworthy way to decide whether a second attempt is a rescue or a duplicate side effect.
The word “idempotent” gets used casually in design reviews. Production does not accept casual definitions. Either the second attempt is safely recognized as the same business act across the boundary that matters, or it is not.
What the Dashboard Shows vs What Is Actually Happening
Dashboards usually tell the truth about symptoms and lie about causes.
You will see request rate rise and assume demand increased. In many retry incidents, business demand is flat. Attempt volume is what increased. If you do not separate original intents from retry attempts in telemetry, the graph tells a false story with great confidence.
You will see CPU at 65 percent and assume the service still has headroom. It may not. The bottleneck may be connection pool exhaustion, thread starvation, lock waits, or database concurrency caps. Retry storms often look surprisingly moderate on coarse CPU graphs because the system is spending more time waiting than computing.
You will see timeout rate rising and read that as “the downstream is failing.” That is only half the story. A timeout says the caller stopped waiting. It says nothing about whether the callee stopped working. Without deadline propagation and cancellation, the server may continue executing abandoned work, meaning the timeout graph is partially a graph of work duplication.
The most dangerous dashboard state is when the graphs still look manageable. Error rate is 2 percent. CPU is 58 percent. Autoscaling has added a few pods. p50 latency is only slightly elevated. A casual read says the system is degraded but under control.
But queue age is rising. Thread pools are pinned near saturation. Connection acquisition is starting to block. Attempts per original request has climbed from 1.0 to 1.6 in two minutes. The positive feedback loop has already started. The system has crossed into a regime where each extra timeout creates enough duplicate work to worsen the next minute.
This is the phase where incidents are still recoverable, and also the phase teams most often misread.
A lot of bad incident calls are made right here, when the dashboards are still polite.
The most useful graph in these incidents is often one teams do not have: attempts per original request, broken down by layer and reason. If original request volume is stable at 10,000 per minute and total attempt volume climbs to 28,000 per minute, the outage is no longer just dependency degradation. It is coordinated amplification.
Another operational view that matters is useful completions per unit of downstream work. During a retry storm, downstream effort can double while successful business outcomes barely move. When that ratio collapses, the system is spending capacity on panic, not progress.
A retry storm often looks healthy enough to ignore right up to the minute it becomes expensive enough to remember.
How Experienced Teams Respond
Experienced teams do not respond to a retry incident by asking how to preserve every request. They respond by asking how to preserve the dependency and the critical path.
The first move is often to cut amplification, not to improve success rate. That may mean reducing retries to zero for a period, raising backoff intervals sharply, disabling one retrying layer, or introducing a hard retry budget. A retry budget is one of the few tools that translates request-level optimism into fleet-level discipline. If the system allows retry traffic to consume at most 15 percent or 20 percent of original traffic over a rolling window, retry load can no longer explode just because everything is timing out.
This is also where exponential backoff with jitter earns its reputation, but with an important correction: backoff is not enough. Without jitter, the fleet retries in synchronized waves. Hundreds or thousands of clients observe the same failure at roughly the same time, then reappear in lockstep. Jitter spreads the retry pressure so the downstream sees a smeared demand pattern rather than a pulse train of collective impatience.
A strong judgment, because it is deserved: no-jitter retries in a large fleet are operational negligence dressed up as resilience.
The next move is to honor overload signals correctly. If the dependency is returning explicit throttles, queue-full responses, or circuit-open signals, those should usually suppress retries rather than trigger them. A 503 is not always an invitation to try again. Sometimes it is the system telling you, as clearly as it can, to stop helping.
Circuit breakers belong in this discussion too. They can help by cutting off obviously failing paths and forcing callers to stop piling on. But teams often talk about them as if installing one solves retry amplification by default. It does not. A half-open circuit probed by a large fleet can create its own mini-storm. And an open circuit at one layer does not mean retries actually stopped across all the others. A circuit breaker is useful when it is part of a shared retry policy. Alone, it is just another local opinion.
Experienced teams also look for stacked retries and remove one layer. In most systems there should be a clearly chosen retry owner for a given hop. If the client SDK retries, maybe the service should not. If the message consumer retries aggressively, maybe the downstream call within the handler should not. If a queue worker is already redelivering, the SDK inside that worker usually should not also run its full retry ladder.
They also get serious about admission. Good teams do not rely on retries and circuit breakers alone. They add client-side concurrency limits, token-bucket admission, or adaptive concurrency controls so a weak dependency sees less speculative demand instead of merely better-spaced demand.
Then comes correctness containment. On any path with non-idempotent side effects, teams confirm the deduplication boundary immediately. Do retries reuse the same idempotency key across attempts? Is the dedupe record written before external side effects, after them, or as part of the same atomic unit? How long is the dedupe memory retained? If the answer is “we think the endpoint is idempotent because it usually is,” the incident has already changed category from reliability to data repair.
What changes at 10x scale is that retry policy stops behaving like a local library choice and starts behaving like capacity policy. At small scale, a bad retry policy can hurt one service. At 10x scale, the same policy changes shard occupancy, autoscaling behavior, database connection churn, queue drain time, and the size of the recovery backlog. More importantly, the time between first slowdown and irreversible feedback gets shorter. A team that had five or ten minutes to notice and react at smaller scale may have less than one minute once the fleet is large and the caller population is broad.
Prevention vs Mitigation vs Recovery
These need to be separated because teams often blur them into one tidy resilience story and end up underprepared in all three phases.
Prevention is design-time discipline. It includes aligning end-to-end deadlines so impossible late retries are not attempted. It includes choosing a retry owner per layer. It includes retry budgets, client-side concurrency limits, overload-aware error handling, circuit breakers that are calibrated to fleet behavior rather than local optimism, and exponential backoff with jitter. It includes explicit idempotency contracts on side-effecting operations, with real semantics around intent identity and retention window. It also includes standardizing retry behavior across languages and SDKs enough that the fleet does not become a patchwork of hidden policies.
Mitigation is incident-time control. This is where you aggressively reduce amplification. You may disable retries on the hottest path, return fast failures for low-priority traffic, temporarily lower concurrency to protect the datastore, pause replaying workers, extend queue visibility instead of redelivering immediately, or shed non-critical operations.
Recovery is where many teams discover the real cost. After the dependency stabilizes, duplicate attempts, orphaned tasks, delayed queue drains, and ambiguous outcomes remain. Recovery may involve replaying only dedupe-safe operations, reconciling payments, reissuing missed notifications, cleaning up half-completed workflows, and auditing inventory holds that were created during ambiguous execution windows.
The incident timer may say 18 minutes. The operational recovery may take two days.
For payments, that usually means actual human work. Operations has to ask the processor which authorizations really settled, compare those results against internal ledger entries, match or recover idempotency keys, identify the cases where the customer saw failure but money moved anyway, and decide which holds should be voided versus captured. The painful part is never the category label “duplicate charge.” It is the set of edge cases where the processor, the ledger, the order state, and the customer-visible UI all disagree for slightly different reasons.
Two caveats matter here.
First, retries are not categorically bad. A single retry after a short randomized delay can materially improve success during brief failovers or connection churn. Treating “never retry” as a mature position is as shallow as treating “always retry” as one.
Second, idempotency is not a complete answer. It can protect business correctness while doing almost nothing for overload survival. A perfectly idempotent payment endpoint can still be taken down by retry amplification. Correctness protection and overload protection are related but separate disciplines.
This is overkill unless the path is both high-volume and high-consequence, but serious teams eventually build per-hop attempt accounting, retry-budget enforcement, explicit idempotency-key observability, and queue replay controls into their platform. On a low-volume internal admin tool, that level of machinery may be excessive. On checkout, inventory, billing, or workflow orchestration, it is not.
The Operational Cost Nobody Budgets For
Teams usually budget for steady-state traffic and maybe for peak traffic. They rarely budget for self-generated traffic during partial failure.
That cost shows up in infrastructure first. More connections, more queue depth, more datastore pressure, more wasted work per successful business outcome. But the deeper cost shows up in operations. Someone has to explain why customer demand looked flat while request volume tripled. Someone has to determine whether “timed out” meant “not executed” or “executed but not acknowledged.” Someone has to reconcile duplicate charges, duplicate emails, duplicate ledger writes, or conflicting workflow transitions.
The hidden cost is not just extra requests. It is ambiguity.
Ambiguity means support tickets. It means manual reconciliation. It means follow-up scripts that are dangerous to run because they may replay a side effect that was only believed to have failed. It means postmortems that consume two teams instead of one because the retry owner and the side-effect owner were different groups with different assumptions.
There is also a capacity planning cost teams tend to miss. Retry storms distort observability enough that they can trick organizations into buying capacity for the wrong problem. More pods may help a little, but if the real issue is duplicated work against a saturated datastore, scaling stateless tiers just accelerates the rate at which pointless demand reaches the bottleneck. Autoscaling is especially deceptive here because it often responds after queue depth and latency have already risen. New instances come online, warm caches, establish TLS sessions, open database connections, and immediately join the same retry policy.
Instead of relieving the downstream, they can worsen cold-start cost and connection pressure during the exact interval the dependency needed fewer callers, not more.
This is another lesson teams learn late: the recovery bill is often paid by people, not systems.
The most expensive retry is often the one that succeeded after it stopped being useful.
Systems do not usually drown in original demand first. Under stress, they drown in their own attempt to be helpful.
What This Changes About How You Design
The design consequence is smaller than teams think and stricter than they want.
You need one retry owner per hop, not several. You need retry budgets that are visible at fleet scope, not just max-attempt counters buried in clients. You need deadlines that compose, so late retries never get launched just to create more work. You need idempotency boundaries defined in business terms, not HTTP-method folklore. And you need overload controls that reduce demand when a dependency weakens instead of merely pacing the demand more politely.
If a design review cannot answer those questions concretely, the system does not have a retry strategy. It has a collection of optimistic defaults.
Retries are not primarily a request convenience. They are a load-multiplication mechanism that activates during failure.
Sometimes that multiplication is worth it. A brief failover, a transient reset, a short-lived routing flap can justify another attempt, especially with tight budgets and jittered backoff. But when a dependency is slow rather than absent, retries often do exactly the wrong thing. They increase arrival rate while service time is already stretched, create overlapping work, inflate queue depth, keep replay systems alive when they should back off, and export waiting pressure to every upstream layer.
At small scale, the same policy may only bruise the system. At large scale, it can become the main event. A 2 second slowdown that is survivable with a few callers can become fatal with hundreds of instances, stacked retries, queue redelivery, full connection pools, pinned thread pools, rising queue age, inconsistent SDK behavior, and autoscaling that joins too late to help.
Idempotency matters because ambiguous completion is common and non-idempotent side effects turn transient failure into business damage. But idempotency is not a blanket safety certificate. It does not protect the dependency from collapse, and it does not magically cover every downstream effect unless the boundary was designed that way explicitly.
The mature posture is not anti-retry. It is anti-naive retry.
Experienced engineers know that the real question is never just whether another attempt might succeed. The real question is whether helping this request is worth the load, delay, replay pressure, and correctness risk imposed on the rest of the system at the worst possible moment.