A junior engineer implements rate limiting: if (count > limit) return 429;
A Staff engineer asks: “What’s your rate? Per what window? Per user? Per tenant? And where do rejected requests go?”

Rate limiting and backpressure are the immune system of distributed systems. They prevent cascading failures by rejecting work early, propagating overload signals, and maintaining system stability under load.


1. Rate Limiting Algorithms

Token Bucket

The most commonly used algorithm. Think of it as a “burst allowance.”

How it works:

  • A bucket holds up to N tokens.
  • Tokens are added at rate R per second.
  • Each request consumes 1 token.
  • If the bucket is empty, reject the request (HTTP 429).

Characteristics:

  • Allows bursts up to N requests.
  • Smooths out to R requests/second over time.
  • Used by: AWS API Gateway, Google Cloud Armor.

Leaky Bucket

Enforces a strict output rate, no matter how bursty the input.

How it works:

  • Requests enter a fixed-size queue (the “bucket”).
  • Requests are processed at a constant rate R.
  • If the queue is full, new requests are dropped.

Characteristics:

  • No bursts: Output rate is always exactly R.
  • Smoother latency, but less flexible.
  • Used by: Network traffic shapers.

Sliding Window

More accurate than fixed windows, but more complex.

Problem with Fixed Windows:

  • If your limit is 100 requests/minute, a user can send 100 requests at 12:00:59 and another 100 at 12:01:00 (200 requests in 2 seconds).

Sliding Window Solution:

  • Track requests in a time-based log.
  • Count requests in the last N seconds (sliding).
  • Trade-off: Higher memory usage.

2. Interactive: Rate Limiter Playground

Visualize how different algorithms behave under bursty traffic.

Tokens Available

10

Request Log

✅ Accepted: 0 ❌ Rejected: 0 🎯 Success Rate: 0%

3. Adaptive Rate Limiting

Static rate limits (100 req/s) don’t account for system load. Adaptive rate limiting adjusts the limit based on real-time conditions.

Cost-Based Rate Limiting

Not all requests are equal. A “search all products” query costs 100x more than “get user profile.”

Solution: Assign cost to each request type.

  • Lightweight query: 1 unit
  • Heavy aggregation: 100 units
  • User quota: 1000 units/minute

Server-Based Adaptive Limiting

Monitor server health (CPU, queue depth, p99 latency). If the system starts slowing down, reduce the rate limit automatically.

Example: If p99 latency > 500ms, temporarily reduce limit to 50% until latency recovers.


4. Backpressure Propagation

Rate limiting rejects requests. Backpressure slows down the sender.

The Problem: Cascading Overload

Service A calls Service B. If B is slow, A’s thread pool fills up, then A starts rejecting requests, then the load balancer marks A unhealthy, then…

The Solution: Propagate Signals

  1. HTTP 429 + Retry-After: Tell the client when to retry (not just “no”).
  2. Queue Depth Metrics: Services expose metrics like “queue depth = 1000.” Callers can back off before hitting a hard limit.
  3. Reactive Streams (Backpressure): In streaming systems (Kafka, Akka Streams), consumers signal “slow down” to producers.

Integration with Circuit Breakers

  • Rate Limit: “You’re sending too fast” (429).
  • Circuit Breaker: “The service is unhealthy, stop calling it entirely” (503).
  • Together: Protect both the caller and the callee.

Staff Takeaway

Rate limiting is not just about rejecting work—it’s about:

  1. Aligning limits with error budgets: If your SLO allows 0.1% errors, your rate limit should reject ~0.1% of requests under normal load, not 10%.
  2. Designing fair policies: Per-user? Per-tenant? Per-API-key? Cost-based?
  3. Propagating backpressure: Rejected requests should carry signals (“retry after 30s”) so the entire system can adapt.

Understanding token bucket math isn’t academic—it’s the difference between graceful degradation and cascading failure.