You know the pattern. Your monitoring dashboard looks healthy: 99% cache hit ratio, database CPU at 15%. Suddenly, a single "hot" key—perhaps the global configuration object or the homepage personalization metadata—expires.
In the 200 milliseconds it takes to fetch the data from the database and repopulate Redis, 5,000 concurrent requests miss the cache. They all rush the database simultaneously. The database CPU spikes to 100%, connections time out, and the backend enters a crash loop because the database never recovers enough to serve the first query that would repopulate the cache.
This is the Cache Stampede (or Thundering Herd). This post details the two architectural patterns to solve it: Distributed Locking and Probabilistic Early Expiration (PER).
The Root Cause: The Latency Gap
The stampede occurs because there is a non-zero time gap ($\Delta$) between detecting a cache miss and writing the new value.
If your system throughput is $R$ requests per second, and the database query takes $T$ seconds, the number of "leaked" requests ($L$) hitting your database during a hard expiration is:
$$ L = R \times T $$
If you process 10,000 req/s and your complex SQL query takes 300ms, a single key expiration leaks 3,000 queries to the DB instantly.
Solution 1: The Distributed Lock (Strict Consistency)
The immediate intuition is to ensure only one process can rebuild the cache at a time. This is achieved via a distributed mutex (Lock).
How it works
- Request A misses cache.
- Request A acquires a Redis lock (
SET NX). - Request A fetches data, updates cache, releases lock.
- Request B misses cache, tries to acquire lock, fails.
- Request B waits (sleeps) and retries the cache read.
The Implementation
We use ioredis for robust connection handling and implement a spin-lock mechanism.
import Redis from 'ioredis';
import { setTimeout } from 'node:timers/promises';
const redis = new Redis(process.env.REDIS_URL);
interface CacheOptions {
ttlSeconds: number;
lockTimeoutSeconds?: number;
}
export class LockedCacheService {
/**
* Retrieves data using a distributed lock to prevent stampedes.
*/
async getOrSet<T>(
key: string,
fetcher: () => Promise<T>,
options: CacheOptions
): Promise<T> {
const cachedValue = await redis.get(key);
if (cachedValue) {
return JSON.parse(cachedValue) as T;
}
const lockKey = `lock:${key}`;
const lockToken = crypto.randomUUID();
const lockTTL = options.lockTimeoutSeconds ?? 5;
// Try to acquire the lock
const acquired = await redis.set(lockKey, lockToken, 'EX', lockTTL, 'NX');
if (acquired === 'OK') {
try {
// Winner takes all: Fetch and update
const data = await fetcher();
await redis.set(key, JSON.stringify(data), 'EX', options.ttlSeconds);
return data;
} finally {
// Release lock strictly if we own it (Lua script for atomicity)
const releaseScript = `
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
`;
await redis.eval(releaseScript, 1, lockKey, lockToken);
}
} else {
// Loser: Wait and retry
// In high-concurrency, use exponential backoff or simple polling
await setTimeout(100);
return this.getOrSet(key, fetcher, options);
}
}
}
The Trade-off
Locking provides strict consistency, ensuring only one DB call occurs. However, it shifts the "Thundering Herd" from the database to the application threads (blocking/sleeping) and Redis itself (polling the lock).
Solution 2: Probabilistic Early Expiration (High Availability)
For high-throughput systems where slight staleness is acceptable, Probabilistic Early Recomputation (PER)—often referenced as the X-Fetch algorithm—is superior. It eliminates locking entirely.
How it works
Instead of waiting for a hard expiration (TTL = 0), we recompute the value before it expires based on a probability curve. As the key gets closer to expiration, the probability of a re-fetch increases.
We use the formula: $$ \text{Gap} = \text{CurrentTime} - (\text{TimeToCompute} \times \beta \times \ln(\text{Random}(0,1))) $$
If Gap >= ExpiryTime, we recompute.
- $\beta$: A tuning factor (usually 1.0). Higher values trigger earlier recomputation.
- $\ln(\text{Random})$: Adds the probabilistic jitter.
This effectively "passes the baton" to a single random request to refresh the cache in the background while others continue to serve the nearly-stale data.
The Implementation
We must store the delta (time it took to compute the value) inside the cache payload.
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
interface PERCacheItem<T> {
data: T;
delta: number; // Time taken to compute in ms
expiry: number; // Logical expiration timestamp
}
export class ProbabilisticCacheService {
// Beta > 1 favors earlier expiration. 1.0 is standard.
private readonly BETA = 1.0;
async getOrSet<T>(
key: string,
fetcher: () => Promise<T>,
ttlSeconds: number
): Promise<T> {
const raw = await redis.get(key);
// 1. Hard Miss: Value doesn't exist at all. Fetch immediately.
if (!raw) {
return this.refreshAndCache(key, fetcher, ttlSeconds);
}
const cached: PERCacheItem<T> = JSON.parse(raw);
const now = Date.now();
const remainingTTL = cached.expiry - now;
// 2. Probabilistic Recomputation Check
// We utilize Math.random() to desynchronize threads.
// As remainingTTL approaches 0, the LHS becomes larger, triggering the refresh.
const probabilityFactor = -cached.delta * this.BETA * Math.log(Math.random());
if (probabilityFactor >= remainingTTL) {
// 3. Trigger Refresh
// Optimization: We can fire this in the background and return stale data
// immediately (non-blocking), or await it (blocking but safe).
// Here we await to ensure data freshness, but rely on probability
// to ensure only ~1 request does this.
return this.refreshAndCache(key, fetcher, ttlSeconds);
}
// 4. Return "Stale" (but valid) Data
return cached.data;
}
private async refreshAndCache<T>(
key: string,
fetcher: () => Promise<T>,
ttlSeconds: number
): Promise<T> {
const start = Date.now();
const data = await fetcher();
const delta = Date.now() - start;
const entry: PERCacheItem<T> = {
data,
delta,
// Logical expiry.
// NOTE: Redis TTL should be set slightly higher than this logic TTL
// to allow "grace periods" for other threads while one refreshes.
expiry: Date.now() + (ttlSeconds * 1000),
};
// Set Redis TTL to 120% of logical TTL to prevent hard eviction
// before the probabilistic algorithm catches it.
const redisTTL = Math.ceil(ttlSeconds * 1.2);
await redis.set(key, JSON.stringify(entry), 'EX', redisTTL);
return data;
}
}
Why PER Wins at Scale
- Non-Blocking: No threads sleep. No locks are contended.
- Graceful Degradation: By setting the physical Redis TTL higher than the logical expiration used in the formula, we ensure data is always available while the refresh happens.
- Optimal Distribution: The randomness in
Math.log(Math.random())ensures that even if 100 servers hit the key at the same millisecond, it is statistically improbable that they all trigger the recompute condition.
Conclusion
Use Distributed Locking when data consistency is paramount, and you cannot afford to serve data that is even milliseconds stale (e.g., financial transactions).
Use Probabilistic Early Expiration for virtually everything else (leaderboards, content, configurations). It allows your backend to maintain flat latency lines even during cache refresh cycles, effectively making cache misses invisible to your throughput metrics.