Distributed API Rate Limiter
High-performance distributed sliding-window rate limiter for strict SLA enforcement across microservices. Uses atomic Redis Lua for correctness under concurrency, with local hot-key shielding and explicit fail-open behavior.
Added Latency
p99 0.27ms
Throughput
10k QPS (verified)
Correctness
No over-admissions (flood test)
Availability
Fail-open
Problem & Constraints
A multi-tenant SaaS API experienced thundering herd behavior: misconfigured client retry loops saturated shared backend capacity, degrading service for healthy tenants.
Requirement: Enforce per-key rate limits with sub-5ms p99 added latency on the request path, with atomic correctness under concurrency and explicit behavior when Redis is slow/unavailable.
Approach: "Boring" Reliability
Prioritized simple, proven primitives to minimize failure modes.
Algorithm: Dual-Counter Sliding Window (Redis Strings)
Chosen over fixed windows to eliminate the window-edge burst vulnerability (2× traffic at boundaries). Implemented sliding semantics using two counters (current + previous) with weighted interpolation—O(1) operations per request, write-heavy friendly, and TTL-bound.
Atomicity: Redis Lua (Single Script)
Encapsulated check + increment + TTL logic in one Lua script executed atomically on Redis—eliminates TOCTOU races without distributed locks.
Three-Tier Decision Flow
- Tier 1: Local cache (Ristretto, 1s TTL) — caches BLOCKED keys to reject at the edge with 0ms and 0 Redis load
- Tier 2: Redis Lua — authoritative atomic decision
- Tier 3: Fail-open — if Redis unreachable, allow traffic to preserve availability (with circuit breaker + metrics)
Verified Load Test Results
10k QPS, single-key adversarial flood
53µs
p50 added latency
274µs
p99 added latency
31ms
max (single outlier)
10k
QPS sustained
Behavior: With a single hot key and a strict per-minute limit, requests were admitted up to the configured limit and then consistently rejected with 429, with no over-limit admissions observed during the run.
Scale Boundaries
Explicitly defined performance cliffs to guide the next scaling step:
| Traffic Load | System Behavior |
|---|---|
| < 100k QPS | Design target: p99 added latency < 2ms. |
| 100k - 400k QPS | Strained. Redis network I/O saturation, latency jitter rises. |
| > 500k QPS | Failure Point. Single Redis shard CPU saturates (single-threaded). |
Next Scale Step: Reduce Redis round-trips via client-side token buffering, and/or shard Redis by API key hash (Redis Cluster).
Postmortem: "Hot Key" Incident
Incident #L4-2024-03
What failed: A single tenant deployed a bug causing a tight loop (tens of thousands req/s) on one key. With no local protection, every request still hit Redis—even when Redis had already blocked the key—creating avoidable load and starving other tenants.
What changed: Added Tier-1 local blocked-key caching (Ristretto, 1s TTL):
-
If Redis blocks a key, cache
BLOCKEDstatus locally. - Subsequent requests are rejected at the edge (0ms latency, 0 Redis load).
- Hot keys stop amplifying Redis load and no longer crowd out healthy tenants.
Engineering Notes: Write-Heavy Workload
Rate limiting is inherently write-heavy (each check updates counters). The design treats rate-limit state as ephemeral and TTL-bound, optimizing for throughput and tail latency rather than durability.
What I'd Do Next
- Multi-region global rate limiting via CRDT/active-active (e.g., Redis Enterprise) for consistent enforcement across regions.
- Adaptive concurrency limits (Little's Law) driven by backend latency signals (complements rate limiting; different control loop).