A junior engineer implements rate limiting: if (count > limit) return 429;
A Staff engineer asks: “What’s your rate? Per what window? Per user? Per tenant? And where do rejected requests go?”
Rate limiting and backpressure are the immune system of distributed systems. They prevent cascading failures by rejecting work early, propagating overload signals, and maintaining system stability under load.
1. Rate Limiting Algorithms
Token Bucket
The most commonly used algorithm. Think of it as a “burst allowance.”
How it works:
- A bucket holds up to
Ntokens. - Tokens are added at rate
Rper second. - Each request consumes 1 token.
- If the bucket is empty, reject the request (HTTP 429).
Characteristics:
- Allows bursts up to
Nrequests. - Smooths out to
Rrequests/second over time. - Used by: AWS API Gateway, Google Cloud Armor.
Leaky Bucket
Enforces a strict output rate, no matter how bursty the input.
How it works:
- Requests enter a fixed-size queue (the “bucket”).
- Requests are processed at a constant rate
R. - If the queue is full, new requests are dropped.
Characteristics:
- No bursts: Output rate is always exactly
R. - Smoother latency, but less flexible.
- Used by: Network traffic shapers.
Sliding Window
More accurate than fixed windows, but more complex.
Problem with Fixed Windows:
- If your limit is 100 requests/minute, a user can send 100 requests at 12:00:59 and another 100 at 12:01:00 (200 requests in 2 seconds).
Sliding Window Solution:
- Track requests in a time-based log.
- Count requests in the last
Nseconds (sliding). - Trade-off: Higher memory usage.
2. Interactive: Rate Limiter Playground
Visualize how different algorithms behave under bursty traffic.
Tokens Available
Request Log
3. Adaptive Rate Limiting
Static rate limits (100 req/s) don’t account for system load. Adaptive rate limiting adjusts the limit based on real-time conditions.
Cost-Based Rate Limiting
Not all requests are equal. A “search all products” query costs 100x more than “get user profile.”
Solution: Assign cost to each request type.
- Lightweight query: 1 unit
- Heavy aggregation: 100 units
- User quota: 1000 units/minute
Server-Based Adaptive Limiting
Monitor server health (CPU, queue depth, p99 latency). If the system starts slowing down, reduce the rate limit automatically.
Example: If p99 latency > 500ms, temporarily reduce limit to 50% until latency recovers.
4. Backpressure Propagation
Rate limiting rejects requests. Backpressure slows down the sender.
The Problem: Cascading Overload
Service A calls Service B. If B is slow, A’s thread pool fills up, then A starts rejecting requests, then the load balancer marks A unhealthy, then…
The Solution: Propagate Signals
- HTTP 429 + Retry-After: Tell the client when to retry (not just “no”).
- Queue Depth Metrics: Services expose metrics like “queue depth = 1000.” Callers can back off before hitting a hard limit.
- Reactive Streams (Backpressure): In streaming systems (Kafka, Akka Streams), consumers signal “slow down” to producers.
Integration with Circuit Breakers
- Rate Limit: “You’re sending too fast” (429).
- Circuit Breaker: “The service is unhealthy, stop calling it entirely” (503).
- Together: Protect both the caller and the callee.
Staff Takeaway
Rate limiting is not just about rejecting work—it’s about:
- Aligning limits with error budgets: If your SLO allows 0.1% errors, your rate limit should reject ~0.1% of requests under normal load, not 10%.
- Designing fair policies: Per-user? Per-tenant? Per-API-key? Cost-based?
- Propagating backpressure: Rejected requests should carry signals (“retry after 30s”) so the entire system can adapt.
Understanding token bucket math isn’t academic—it’s the difference between graceful degradation and cascading failure.