Follow a request from the edge inward and ask a hard question at each step: where is the first point where rejecting this request would still matter?
The request arrives at the gateway. If the limiter rejects here, you save everything behind it: auth CPU, routing work, cache lookups, database calls, downstream fanout. This is the cheapest place to reject. It is also the coarsest place to measure.
The request passes the limiter and hits authentication. In some systems that means local token verification. In others it means cache lookup, key fetch, or token introspection. Already, not all requests are equally cheap.
The request enters the service and does request shaping. Maybe it expands IDs, looks up tenant policy, or applies feature flags. Then it goes to cache. If there is a hit, the request may cost 3 milliseconds and almost no scarce resources. If there is a miss, it may fan out to a dependency that is currently slow.
This is where rate limiting gets misunderstood. The gateway limited “requests.” The system is really split into at least two work classes: cheap hits and expensive misses. When cache hit rate falls from 98 percent to 85 percent because of a hot deploy, eviction wave, or a cold region, the cost per admitted request changes radically while the limiter keeps enforcing the same number.
A small-scale example makes this concrete. Imagine an internal metadata service doing 400 RPS. Under normal conditions, 95 percent of requests hit cache and complete in 4 milliseconds. The remaining 5 percent miss and hit Postgres at 25 milliseconds. At that scale, a simple per-instance token bucket of 150 RPS across three instances is fine. Even if one instance sees a temporary burst, the database still has headroom. Now imagine a deploy bug drops cache hit rate to 75 percent. Traffic is still only 400 RPS. The limiter still looks generous. But database traffic has grown from 20 miss-path requests per second to 100. The edge looks calm. The database is the part that is now lying on the floor.
Now take the same path through the lens of a degraded dependency. The downstream inventory service normally responds in 25 milliseconds. Today it is at 300 milliseconds because of disk pressure. Your service holds worker threads, memory, and sockets open 12 times longer for the same logical request. A gateway limiter sized for healthy days will now admit far more work than the degraded day can hold safely.
That is why rate limiting does not fix slow dependencies. Slow dependencies change occupancy. Rate limits regulate arrivals. Those are related. They are not interchangeable.
This is where teams start arguing with the graph instead of the system.
The dashboard usually does not begin with a clear “rate limiter misconfigured” graph. It begins with p99 rising on one endpoint, then a drop in cache hit rate, then maybe a jump in connection pool wait time, then retry traffic, then CPU growth in the app tier, then broad tail latency spread to unrelated endpoints because worker pools and event loops are shared. By the time someone says, “but traffic is only 3,200 RPS and the limit is 10,000,” the wrong mental model has already cost the team 20 minutes.
In production, the first graph that looks better is often the one least connected to the failure.
There is another subtle failure here. Suppose you place the limiter after authentication and tenant policy resolution because you want fairer per-tenant accounting. Reasonable. But if auth is CPU-heavy or calls a remote key service, you may be rejecting too late. Your accounting got better. Your protection got worse.
Late rejection can improve fairness while worsening survivability.
That is not theory. That is scar tissue.
At small scale, teams often start with one or two simple controls.
A local in-process token bucket per instance. Maybe a gateway-level per-API-key limit. Maybe a global cap to prevent obvious overload. This is often enough for a single-region service with moderate traffic, low fanout, and a relatively uniform cost profile.
Imagine an internal service doing 300 RPS across four instances. Each request is a fast cache lookup plus an occasional DB read. A local rate limit of 100 RPS per instance gives rough fairness, no shared limiter dependency, and low operational overhead. One instance may admit a bit more than its fair share during a short burst. That is acceptable because the service has headroom.
At 10x scale, the architecture changes even if the algorithm does not. At 3,000 RPS instead of 300, the same service is no longer mostly about protecting CPU on app nodes. It starts being about protecting the narrowest internal resource: database connections, miss-path concurrency, queue age, shard skew, and the percentage of requests that fall off the fast path.
At 300 RPS, a blunt global limit mostly guards against obvious accidents. At 3,000 RPS, that same blunt limit starts hiding which work is expensive, which tenant is dominant, and which endpoint is actually spending the bottleneck.
Now move to a larger-scale example. A public API is doing 40,000 RPS across three regions. Tenants differ wildly in size. One endpoint triggers expensive fanout to a search cluster. Another is mostly cacheable. Certain customers run nightly bulk sync jobs. Traffic is spiky around top-of-hour schedules. Some clients retry aggressively on 429s. At that point, a single limiter is not architecture. It is a comfort blanket.
The architecture usually evolves in layers:
At the edge, coarse global and per-client controls to stop gross excess demand.
At the service boundary, per-endpoint or weighted limits because request classes have different costs.
At fragile dependencies, concurrency caps because the safe variable is active work, not arrivals.
For multi-tenant fairness, per-tenant budgets so one customer cannot spend the whole fleet.
For critical traffic, reserved capacity by class so checkout, writes, or control-plane paths do not compete on equal terms with reports, exports, or feed refreshes.
One thing teams learn only at larger scale is that burst amplitude matters more than average load. A service that is comfortable at 20,000 flat RPS may still fail at 12,000 average RPS if that average is produced by 60,000 RPS spikes for two seconds every minute. Bursts blow through caches, synchronize misses, compress queueing delay into short windows, and concentrate expensive work in ways averages hide.
As systems scale, the limiting problem stops being “how much traffic can we take?” and becomes “which scarce thing are we willing to let traffic consume?”
That is the right question. It is also the one rate limiting is often used to avoid.
The algorithms matter, but mostly because of the failure they allow you to tolerate.
Token Bucket
Use token bucket when short bursts are acceptable and the downstream path has real headroom. It is the practical default when you want to enforce an average rate without making every micro-spike visible to callers.
Avoid it when short bursts are exactly what create the failure: cache stampedes, synchronized misses, lock contention, or queue spikes against a fragile dependency.
What token bucket is really saying is: we are willing to spend stored slack on short demand spikes.
The hidden design choice is the bucket size. Teams like discussing refill rate because it sounds like policy. Bucket size is usually the more dangerous parameter because it decides how much instantaneous shock you permit into the system.
If your system handles 2,000 steady RPS and folds under a 5-second 8,000 RPS burst because cache misses synchronize and downstream queues spike, the bucket is not a convenience setting. It is the outage.
Sliding Window
Use sliding window when quota semantics and fairness perception are part of the product. It is valuable for external APIs, paid plans, and places where “100 requests per minute” needs to mean roughly what the customer thinks it means.
Avoid it when the real problem is protecting a hot dependency. A beautifully precise quota at the edge is still the wrong tool if the thing dying is a search pool, a shard, or a tenant-local hotspot inside the service.
Sliding window is very good at solving the accounting problem. Teams often mistake that for solving the protection problem. And exactness gets expensive faster than teams expect. Once quota checks sit on a cross-region or shared-store hot path, you are paying real latency and real fragility for policy precision.
Its weakness is operational. Exact distributed sliding windows push you toward shared state, coordination, and limiter-store dependencies. That is worth paying for when fairness precision is the point. It is often the wrong payment when survivability is the point.
Leaky Bucket
Use leaky bucket when delay is acceptable and explicit, especially for asynchronous workloads, background jobs, or notification pipelines where downstream systems prefer a steady rate over abrupt spikes.
Avoid it in synchronous latency-sensitive APIs unless you are willing to own the queueing semantics. Otherwise you are not protecting the user experience. You are translating overload into waiting.
Leaky bucket is about smoothing, not mercy.
At larger burst amplitudes, this becomes a policy choice about queue age, memory growth, stale work, and cancellation semantics. Smoothing 500 extra requests is one thing. Smoothing 50,000 extra requests is a statement about how much old work the system is willing to drag behind it.
This is the center of the article.
Global limits versus local truth.
A global limit is easy to reason about and easy to explain. It also ignores skew. A system can be under the global budget and still have one overloaded partition, one exhausted pool, or one starved tenant.
Per-user versus per-tenant fairness.
Per-user sounds fair until one tenant has 50,000 users and another has 200. Per-user limits protect against one user going wild. They do not necessarily protect shared infrastructure from one large customer. Per-tenant limits better match infrastructure impact in many SaaS systems, but they can punish healthy internal diversity within that tenant.
Per-endpoint limits versus policy sprawl.
Per-endpoint policies reflect real cost better, especially when one endpoint fans out to four services and another is mostly cacheable. But every new dimension multiplies tuning burden, observability needs, and incident complexity.
Burst tolerance versus downstream stability.
Burst tolerance is not generosity. It is a bet about what the internals can absorb. A larger token bucket may improve client experience on healthy days and worsen outage behavior on degraded ones.
Immediate rejection versus queued admission.
Immediate rejection feels harsh. Queuing feels graceful. In many latency-sensitive systems, queued admission is the more destructive choice because it hides overload until timeouts and retry storms spread it outward.
Precision versus survivability.
Highly consistent distributed limiting gives better fairness semantics. It also creates a critical dependency. When the limiter store is slow or unavailable, you now have to choose whether to fail open or fail closed.
Here is the opinionated judgment: if the goal is keeping a fragile dependency alive, I would rather have a rough limiter near the bottleneck plus a hard concurrency cap than a beautifully precise global quota at the edge.
Precision matters when fairness is the product. Near-bottleneck truth matters when survivability is the goal.
Adaptive limiting deserves a note here because it is the next thing a serious reader should ask about. Dynamic limits that respond to dependency health, queue depth, or latency can close part of the gap static rate limits leave open. They help when the safe boundary moves with system state. They are also fragile in their own way. If your feedback signal is noisy or delayed, adaptive limiting oscillates, chases yesterday’s failure, or tightens exactly when recovery needs room. Adaptive admission is powerful. It is not magic. It is a control loop, and bad control loops create their own incidents.
Two caveats matter.
In adversarial environments, identity precision and edge enforcement matter more because the system is defending against clients who are actively trying to exploit policy.
And some businesses genuinely need hard tenant quotas for billing or contract reasons. In those cases, exactness is not vanity. It is policy enforcement. Just do not confuse policy enforcement with runtime protection.
Failure Modes
The obvious failure mode is “limit set too high.” That is not the interesting one.
The interesting failures are the ones where the limiter appears to be doing its job.
Failure chain 1: inbound demand is controlled, downstream concurrency still collapses
A team sets the gateway limit correctly for normal demand. The edge stays under 8,000 RPS. The limiter returns clean 429s on overflow. Everyone feels safer.
What the dashboard shows first is modest improvement at the edge. Inbound RPS flattens. Rejection counts rise in a neat line. CPU at the gateway looks stable. The limiter graph looks healthy.
What is actually broken first is deeper in the path. The recommendation dependency goes from 40 millisecond p99 to 450 millisecond p99. Connection occupancy rises. In-flight requests pile up. Application worker slots stay busy longer. Queue age inside the service starts growing before global RPS looks alarming.
The early signal is not the edge. It is queue wait time, dependency concurrency, or rising pool acquisition latency on one downstream call path.
Immediate containment is not “tighten the edge limit and hope.” It is capping concurrency into the slow dependency, shedding the expensive endpoint class, or degrading that path to cached or partial results.
The durable fix is to separate ingress protection from dependency protection. Keep the gateway limiter for excess demand, but add a tighter admission boundary around the fragile dependency or expensive miss path.
Most outages do not start with a limiter set to nonsense. They start with a limiter set to a number that used to be true.
Longer-term prevention means measuring safe concurrency under degraded latency, not just normal throughput. If the dependency is safe at 500 concurrent calls on a good day and only 120 during a brownout, your production limits need to reflect the degraded state you intend to survive.
Retry amplification belongs here because it makes this failure shape worse in a hurry. Suppose 10% of requests are rejected or timed out, and 80% of those retry within one second. A nominal 8,000 RPS stream becomes 8,640 effective arrivals almost immediately. If the retry wave itself sees 10% failure and the same retry behavior, the next wave adds another 51 RPS. The exact numbers vary. The point does not. Once a slow path starts failing, “only a small percentage retried” is often enough to erase the margin your limiter was supposed to preserve.
Failure chain 2: one noisy tenant stays “within policy” and still burns shared internals
The platform enforces per-user limits. One enterprise tenant uses 20,000 active users and launches a bulk sync. No individual user violates quota. Total platform traffic remains below the global cap.
What the dashboard shows first is deceptive normality. Total RPS looks acceptable. Rejection rate stays low. Per-user quota enforcement looks clean.
What is actually broken first is tenant-local concentration. A single tenant shard, a per-tenant cache namespace, one rate-heavy index, or a shared fanout path saturates well before the whole fleet does.
The early signal is tenant-specific latency skew, shard imbalance, or one backend partition developing queue growth while global graphs still look calm.
Immediate containment is imposing a tenant budget, slowing or pausing the bulk workflow, and preserving separate capacity for interactive traffic.
The durable fix is aligning limits to infrastructure cost. If one tenant can consume 35 percent of internal search capacity while still being “fair” by user, then user identity is the wrong control surface.
This is usually the point where the policy still looks principled and the system already knows it is being cheated.
Longer-term prevention means per-tenant quotas, workload classes, and reserved priority bands so bulk exports do not compete directly with interactive reads.
Failure chain 3: the global limiter works, but priority traffic still loses
A team adds a global edge rate limit and declares overload protection done. The system caps total requests at 12,000 RPS. During the next surge, low-value traffic and high-value traffic are both admitted until the bucket empties.
What the dashboard shows first is clean policy compliance. Total traffic stays below 12,000. Rejected request counts rise as expected. Alerting says the limiter is active.
What is actually broken first is service value allocation. Checkout, writes, or control-plane actions are still competing with feed refreshes, exports, or analytics pulls. The limiter is protecting total volume, not useful work.
The early signal is user pain in high-value flows while the system-wide rate story still looks controlled.
Immediate containment is reserving capacity by traffic class and shedding low-value work first.
The durable fix is admission by priority class, not just by aggregate arrival count.
Longer-term prevention means deciding in advance which work deserves survival. Mature systems do not ask checkout and report generation to fail together in the name of fairness.
Failure chain 4: rejection looks healthy while queue age and lock contention get worse
This is the most misleading graph pattern in the category.
The edge limiter rejects 15 percent of traffic. Gateway throughput is flat. Error rate from the limiter is tidy and attributable. On a superficial dashboard, it looks like the system is protecting itself exactly as designed.
Meanwhile, admitted work is still landing on a small set of locked rows, a hot cache key, or a shared worker pool. Lock wait time rises. Queue age rises. Workers get occupied by requests that already passed the limiter and are now stuck contending on the real bottleneck.
The early signal is not request rate. It is queue age, worker starvation, lock wait time, or queue completion lag on a specific internal subsystem.
Immediate containment is reducing admission for the contended work shape specifically. That may mean disabling one write-heavy endpoint, putting a guardrail on one object ID or tenant, or applying a tighter limiter around the lock-bearing path.
The durable fix is limiting the unit that actually contends. If lock contention is the failure surface, a generic front-door RPS limit is the wrong instrument.
Clean rejection graphs have ended more than one discussion that should have continued for another fifteen minutes.
Longer-term prevention means putting queue age and lock contention on the main dashboard. Teams trust the graphs they can see.
Failure chain 5: per-user limits are fair to users and unfair to infrastructure
Per-user limits are seductive because they look equitable. Every user gets the same quota. The problem is that equal request counts do not imply equal infrastructure cost.
One user fetches cached account metadata. Another triggers a report that scans partitions, fans out to a search tier, and writes to object storage. Both count as one request against policy.
What is actually broken first is cost asymmetry. One class of requests consumes 20x more CPU, memory, downstream calls, or queue time than another. The limiter meters identity, not work.
Immediate containment is separating expensive endpoints into their own budget or putting them behind weighted or concurrency-aware controls.
The durable fix is metering something closer to cost. Per-endpoint quotas, weighted tokens, separate work classes, or per-tenant budgets are all better answers than pretending uniform request count reflects real demand.
And there is a caller-side cost here that teams often ignore. A 429 is not free. It may be a failed checkout, a webhook that now needs replay, or a batch workflow that will retry later and create more load. Where you reject work determines who pays for protection. Good systems make that trade on purpose.
Failure chain 6: the front door is protected, internal fanout is not
This shows up in aggregator APIs all the time. The public edge has a strong limiter. A request gets admitted, then fans out to five downstream calls and also triggers asynchronous side work.
What the dashboard shows first is protected ingress. The API edge is calm. Client arrivals are capped. Rejection rate is understood.
What is actually broken first is internal amplification. One admitted request may still generate five search calls, two cache refreshes, and a background write. The limiter controlled front-door requests, not total work created behind the door.
The early signal is internal call volume diverging from ingress volume. One downstream sees 4x growth while edge traffic barely moves.
Immediate containment is bounding fanout, disabling secondary side effects, or cutting optional downstream calls from the admitted path.
The durable fix is limiting or capping fanout edges themselves, not just the customer-facing ingress path.
Cooperative clients help, but only up to a point. Well-behaved SDKs with backoff, client token budgets, or local circuit breakers reduce damage. They do not remove the need for server-side enforcement, because not all clients are cooperative and even good clients become dangerous when many of them follow the same retry script at once.
Failure chain 7: the limiter itself becomes fragile infrastructure
Distributed rate limiting fails in ways teams underestimate. A shared Redis or centralized quota service starts taking cross-region traffic. Under load or packet loss, quota checks become slower or inconsistent.
What the dashboard shows first is confusing behavior. Some instances reject more than others. One region is stricter. Another is permissive. Client complaints sound random because enforcement is no longer stable.
What is actually broken first is the control plane. The limiter store is now a hot-path dependency. Enforcement consistency degrades before the service itself is obviously overloaded.
The early signal is rising quota check latency, cache miss rate against the limiter store, or divergence in rejection patterns across nodes and regions.
Immediate containment depends on the failure policy. Teams either fail open and protect user experience while risking overload, or fail closed and protect internals while accepting higher visible rejection.
The durable fix is matching limiter architecture to the value of precision. For runtime protection, local or near-local approximation is often safer than globally coordinated exactness.
Here is the explicit failure chain teams routinely misread:
Inbound demand rises from 7,000 to 9,500 RPS during a product launch.
The edge limiter caps traffic at 8,000 RPS and cleanly rejects the rest.
The gateway dashboard looks disciplined. Rejections are clear. CPU is stable.
On-call believes the limit is protecting the system.
But the admitted mix has shifted. Cache misses doubled, one endpoint now dominates, and each admitted request holds search connections 6 times longer than normal.
Queue age rises inside the service. Workers starve. Retries from admitted requests increase.
Users see timeouts and stale results even though limiter behavior looks healthy.
The wrong boundary was protected. Ingress was controlled. The bottleneck was not.
Blast Radius and Failure Propagation
Rate limiting mistakes are dangerous because they distort where pain shows up.
A multi-tenant API serves 50,000 RPS across three regions. Most traffic is light reads, but one endpoint called /search/export is expensive and hits a shared search cluster plus object storage. The platform has a global 60,000 RPS limiter and a per-user cap. No per-endpoint or per-tenant controls.
At 09:00, one large tenant starts an export job from 8,000 distinct users. No per-user limit is violated. Global traffic rises only to 54,000 RPS, still under the ceiling. The search cluster saturates first. Query latency rises from 80 milliseconds to 1.2 seconds. Application workers begin holding search futures longer. Memory pressure rises from buffered responses. Garbage collection gets worse. Other search-backed endpoints slow even though their own traffic is unchanged. Timeouts trigger retries from mobile clients and SDKs. Gateway traffic rises, but the retry mix now includes previously healthy endpoints. Cache refresh jobs miss deadlines. Object storage write queues build because exports complete more slowly.
The early signal is not total request volume. It is export-path concurrency against search, rising object-storage queue delay, and tenant-local amplification from one workload shape.
Immediate containment is clamping or disabling export traffic, applying a tenant-specific cap, and preserving protected capacity for interactive reads. The durable fix is separate budgets by endpoint class and tenant, plus dependency-level concurrency limits on export fanout.
A useful limiter would have narrowed blast radius by identifying export traffic as a distinct class, capping its concurrency or per-tenant budget, and preserving capacity for interactive reads. The existing limiter preserved the appearance of order while allowing internal disorder to spread.
A bad limiter can increase blast radius indirectly by making operators trust the wrong graph.
Operational Complexity
You have to choose the enforcement boundary. Gateway, service, endpoint, dependency wrapper, queue consumer, or all of them.
You have to choose the identity model. IP, user, API key, tenant, endpoint, region, job type, or cost class.
You have to choose the action on rejection. Return 429 with a retry hint. Drop silently. Defer to a queue. Degrade to cached or partial results. Reserve a protected budget for critical traffic.
You have to set the number. That is usually where the model quietly breaks. Teams often derive limits from historical peak traffic. What they really need is safe admitted load under different dependency states. A service might safely handle 4,000 RPS when cache hit rate is 98 percent, 2,200 RPS when it is 92 percent, and only 900 RPS when one backend is in degraded mode. Static limits are blunt because capacity is state-dependent.
You also need observability that tells the truth. “Rejected requests per second” is not enough. You need to know:
what class of work was rejected
what class was admitted
what happened to dependency concurrency
whether tail latency for priority traffic improved
whether retries rose or fell
whether admitted traffic shifted toward more expensive paths
whether queue age, worker starvation, or lock wait time improved after limiting
whether one tenant or endpoint is consuming disproportionate internal budget
Without that, rate limiting becomes numerology.
At larger scale, the limiter itself becomes infrastructure. The moment you need globally shared state for quotas, cross-region coordination for fairness, or central storage for sliding windows, rate limiting starts consuming reliability budget of its own. A limiter checked 100,000 times per second is no longer policy code. It is a distributed system on your hot path.
Operationally, teams make a category error here. They monitor whether the limiter is enforcing policy. They do not monitor whether enforcement improved the health of the intended bottleneck. Those are different questions.
The common mistakes are not mysterious. They are substitutions engineers make under pressure.
Counting what is easy instead of what is expensive.
Putting the limiter where implementation is convenient instead of where rejection changes fate.
Using a global limit for mixed-cost traffic and acting surprised when cheap traffic gets rejected while expensive traffic still walks into the bottleneck.
Calling per-user quotas “fair” in systems where the real spender is the tenant, the endpoint, or the workflow.
Ignoring retries, internal fanout, and async side effects because they are not visible in the first ingress graph.
Treating clean rejection graphs as evidence that protection worked.
The uglier reality is that rate limiting often arrives in the design before anyone can name the first scarce resource with confidence.
That is usually the tell.
The pattern underneath all of them is the same: rate limiting gets used as a substitute for understanding the real bottleneck.
Use rate limiting when you need to control excess incoming demand before it enters expensive parts of the system.
Use token bucket when brief bursts are acceptable. Use sliding window when quota semantics are part of the product. Use leaky bucket when delay is acceptable and smoothing is the goal.
Use per-tenant or per-endpoint controls when infrastructure cost is not evenly distributed across callers or routes.
Use rate limiting as one layer among others when the system has multiple failure surfaces. In real systems, it is most useful when paired with concurrency control, queue bounds, and priority classes.
Do not use rate limiting as the primary fix for slow dependencies. Do not expect it to solve hot keys, shard hotspots, or lock contention. Do not rely on global limits for mixed-cost traffic when one endpoint or one tenant is clearly more expensive. Do not queue synchronous overflow just to avoid returning 429s if the likely result is timeout cascades and retry storms.
And do not comfort yourself with “we have a rate limiter” during incident analysis. That sentence has hidden more real bottlenecks than it has solved.
Senior engineers start by locating the first scarce resource, not by choosing an algorithm.
Then they separate goals. Abuse control is not fairness. Fairness is not downstream protection. Billing enforcement is not overload management. Once those goals are separated, the architecture gets clearer. One limiter might live at the edge for client-level quotas. Another might enforce per-tenant budgets. A third might cap concurrent requests into a recommendation backend. That is not redundancy. It is alignment.
They also think about ownership. Somebody has to own the limit, own the exceptions, own the degraded-mode behavior, and own the question “what happens when the limiter is wrong?” A surprising amount of bad rate limiting is not technical failure. It is organizational ambiguity wearing technical clothes.
Most importantly, senior engineers are willing to reject valuable work on purpose if that is what preserves essential work.
The point of admission control is not politeness. It is selective survival.
Rate limiting is useful, but it is narrow.
Find the first bottleneck.
Measure the real unit of work.
Reject early enough to change the system’s fate.
Protect the scarce thing, not the abstract idea of traffic.
That is where rate limiting stops being a checkbox and becomes architecture.