Distributed API Rate Limiter

Problem & Constraints

A multi-tenant SaaS API experienced thundering herd behavior: misconfigured client retry loops saturated shared backend capacity, degrading service for healthy tenants.

Requirement: Enforce per-key rate limits with sub-5ms p99 added latency on the request path, with atomic correctness under concurrency and explicit behavior when Redis is slow/unavailable.

Approach: "Boring" Reliability

Prioritized simple, proven primitives to minimize failure modes.

Algorithm: Dual-Counter Sliding Window (Redis Strings)

Chosen over fixed windows to eliminate the window-edge burst vulnerability (2× traffic at boundaries). Implemented sliding semantics using two counters (current + previous) with weighted interpolation—O(1) operations per request, write-heavy friendly, and TTL-bound.

Atomicity: Redis Lua (Single Script)

Encapsulated check + increment + TTL logic in one Lua script executed atomically on Redis—eliminates TOCTOU races without distributed locks.

Three-Tier Decision Flow

Tier 1: Local cache (Ristretto, 1s TTL) — caches BLOCKED keys to reject at the edge with 0ms and 0 Redis load
Tier 2: Redis Lua — authoritative atomic decision
Tier 3: Fail-open — if Redis unreachable, allow traffic to preserve availability (with circuit breaker + metrics)

Verified Load Test Results

10k QPS, single-key adversarial flood

53µs

p50 added latency

274µs

p99 added latency

31ms

max (single outlier)

10k

QPS sustained

Behavior: With a single hot key and a strict per-minute limit, requests were admitted up to the configured limit and then consistently rejected with 429, with no over-limit admissions observed during the run.

Scale Boundaries

Explicitly defined performance cliffs to guide the next scaling step:

Traffic Load	System Behavior
< 100k QPS	Design target: p99 added latency < 2ms.
100k - 400k QPS	Strained. Redis network I/O saturation, latency jitter rises.
> 500k QPS	Failure Point. Single Redis shard CPU saturates (single-threaded).

Next Scale Step: Reduce Redis round-trips via client-side token buffering, and/or shard Redis by API key hash (Redis Cluster).

Postmortem: "Hot Key" Incident

Incident #L4-2024-03

What failed: A single tenant deployed a bug causing a tight loop (tens of thousands req/s) on one key. With no local protection, every request still hit Redis—even when Redis had already blocked the key—creating avoidable load and starving other tenants.

What changed: Added Tier-1 local blocked-key caching (Ristretto, 1s TTL):

If Redis blocks a key, cache BLOCKED status locally.
Subsequent requests are rejected at the edge (0ms latency, 0 Redis load).
Hot keys stop amplifying Redis load and no longer crowd out healthy tenants.

Engineering Notes: Write-Heavy Workload

Rate limiting is inherently write-heavy (each check updates counters). The design treats rate-limit state as ephemeral and TTL-bound, optimizing for throughput and tail latency rather than durability.

What I'd Do Next

Multi-region global rate limiting via CRDT/active-active (e.g., Redis Enterprise) for consistent enforcement across regions.
Adaptive concurrency limits (Little's Law) driven by backend latency signals (complements rate limiting; different control loop).