Skip to main content

Redis Caching Patterns: Implementing Probabilistic Early Expiration (Jitter) to Stop Thundering Herds

 

The Hook

It starts with a single alerting pixel. Your database CPU spikes from 15% to 100% in less than a second. Latency alarms scream, and the connection pool exhausts immediately. The application isn't just slow; it’s down.

The culprit? A single, highly accessed cache key (like a configuration blob or a homepage feed) expired.

At $t=0$, the key existed. At $t+1ms$, it didn't. In that millisecond window, 5,000 incoming requests checked Redis, found nothing, and simultaneously hammered your primary database to recompute the same dataset. This is the Thundering Herd problem (also known as Cache Stampede), and standard TTLs are not enough to prevent it.

The Why: Anatomy of a Stampede

To solve this, we must understand the race condition. In a high-throughput system, a cache miss is expensive.

  1. Synchronization Gap: When a key expires physically in Redis (TTL hits 0), it vanishes atomically.
  2. Concurrent Reads: If your throughput is 10k RPS, and your database query takes 200ms, a generic cache-aside pattern allows roughly 2,000 requests to pass through to the DB before the first request finishes writing the new cache value.
  3. Resource Starvation: The database, designed for ACID compliance and complex queries, cannot handle the read volume of Redis. Threads lock, queues fill, and the system spirals into failure.

Standard solutions like "Mutex Locking" (distributed locks) introduce latency and complexity. A superior, non-blocking approach is Probabilistic Early Expiration (often referenced as the X-Fetch algorithm or PER).

The Fix: Probabilistic Early Recomputation

Instead of waiting for the key to completely expire, we want clients to probabilistically decide to recompute the value as the TTL approaches zero.

We use a randomization factor (Jitter) combined with the time remaining on the key. As the key gets closer to expiration, the probability of a client deciding to recompute it increases exponentially.

This ensures one request (statistically) triggers the refresh before the key is actually deleted, while all other concurrent requests continue to be served the slightly stale data from Redis.

The Algorithm

The decision to recompute is based on this inequality:

$$ (Now - CachedTime) \ge (TTL - \beta \cdot \Delta \cdot \ln(random())) $$

Where:

  • $\Delta$ (Delta): Estimated time to recompute the value (e.g., 50ms).
  • $\beta$ (Beta): A scaling factor (usually 1.0). Higher values trigger earlier recomputation.
  • $\ln(random())$: Adds the probabilistic jitter.

Implementation

Below is a production-ready TypeScript implementation using ioredis. This implementation uses a Lua script to atomically fetch the value and its remaining TTL to ensure the calculation is accurate.

import Redis from 'ioredis';

// Configuration
const REDIS_URL = process.env.REDIS_URL || 'redis://localhost:6379';
const redis = new Redis(REDIS_URL);

/**
 * Custom error to signal that the upstream fetch failed.
 */
class UpstreamFetchError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'UpstreamFetchError';
  }
}

/**
 * Lua script to get value and PTTL (Time To Live in ms) atomically.
 * Returns: [value, ttl_in_ms]
 */
const GET_WITH_TTL_SCRIPT = `
  local val = redis.call("get", KEYS[1])
  local ttl = redis.call("pttl", KEYS[1])
  return {val, ttl}
`;

// Register the script command strictly once
redis.defineCommand('getWithTtl', {
  numberOfKeys: 1,
  lua: GET_WITH_TTL_SCRIPT,
});

// Extend interface for TS support on custom command
declare module 'ioredis' {
  interface Redis {
    getWithTtl(key: string): Promise<[string | null, number]>;
  }
}

interface ProbabilisticOptions {
  ttlSeconds: number;
  beta?: number; // Scaling factor, default 1.0
  computationTimeMs?: number; // Expected time to fetch from DB, default 50ms
}

/**
 * Fetches data using Probabilistic Early Recomputation.
 * 
 * @param key The Redis key
 * @param fetchFn The function to call if cache is missed or pre-expired
 * @param options configuration for TTL and algorithms
 */
export async function getOrSetProbabilistic<T>(
  key: string,
  fetchFn: () => Promise<T>,
  options: ProbabilisticOptions
): Promise<T> {
  const { ttlSeconds, beta = 1.0, computationTimeMs = 100 } = options;

  // 1. Attempt to get value and TTL from Redis
  const [cachedValue, remainingTtlMs] = await redis.getWithTtl(key);

  let shouldRecompute = false;

  // 2. Logic: If cache is missing, obviously recompute
  if (!cachedValue) {
    shouldRecompute = true;
  } else if (remainingTtlMs > 0) {
    // 3. Logic: Probabilistic Check
    // We recompute if the remaining TTL is dangerously close to 0
    // based on our random jitter.
    // Formula: gap < beta * delta * log(rand)
    
    // -Math.log(Math.random()) generates a value from 0 to Infinity, 
    // heavily weighted towards low numbers.
    const randomJitter = -Math.log(Math.random());
    const recomputeThreshold = computationTimeMs * beta * randomJitter;

    // If the time left is less than our threshold, we volunteer to refresh
    if (remainingTtlMs < recomputeThreshold) {
      shouldRecompute = true;
    }
  }

  // 4. Return cached value immediately if we aren't the "chosen one"
  if (!shouldRecompute && cachedValue) {
    try {
      return JSON.parse(cachedValue);
    } catch (e) {
      // Corruption safeguard
      shouldRecompute = true; 
    }
  }

  // 5. Recomputation Phase (The Winner)
  // We are either the first caller, or we won the probabilistic lottery.
  try {
    const start = Date.now();
    const newValue = await fetchFn();
    const duration = Date.now() - start;

    // Update Redis
    // We update the computationTimeMs estimate dynamically if needed, 
    // but for simplicity, we just set the value.
    const serialized = JSON.stringify(newValue);
    await redis.set(key, serialized, 'EX', ttlSeconds);

    return newValue;
  } catch (err) {
    console.error(`DB Fetch failed for ${key}`, err);
    
    // Fallback: If we have stale data, return it rather than crashing
    if (cachedValue) {
      return JSON.parse(cachedValue);
    }
    throw new UpstreamFetchError(`Failed to fetch ${key}`);
  }
}

// --- Usage Example ---

interface UserProfile {
  id: string;
  name: string;
  heavyData: string;
}

async function getUserProfile(userId: string): Promise<UserProfile> {
  const cacheKey = `user:${userId}`;
  
  return await getOrSetProbabilistic(
    cacheKey,
    async () => {
      // Simulate slow DB call
      await new Promise(resolve => setTimeout(resolve, 200)); 
      return { 
        id: userId, 
        name: "Architect", 
        heavyData: "x".repeat(1000) 
      };
    },
    {
      ttlSeconds: 600, // 10 minutes
      beta: 1.0,       // Default aggression
      computationTimeMs: 200 // We expect DB to take 200ms
    }
  );
}

The Explanation

Why does this math actually work?

1. The Gap

remainingTtlMs represents how close we are to the "cliff" (the cache expiring). When remainingTtlMs is large (e.g., 9 minutes left on a 10-minute key), the inequality is almost never true. We return the cache.

2. The Jitter

-Math.log(Math.random()) is the magic.

  • If Math.random() is 0.99, result is 0.01.
  • If Math.random() is 0.01, result is 4.6.

As remainingTtlMs shrinks, it becomes smaller than computationTimeMs * beta * random_factor.

3. The Result

Imagine 1,000 requests hit this key 100ms before it expires.

  • Standard caching: All 1,000 see a hit. At 0ms, all 1,000 see a miss.
  • Probabilistic caching: At 100ms remaining, perhaps 1 request calculates a "True" on the jitter check. That one request extends the cache validity (by overwriting the key with a fresh TTL). The other 999 requests calculated "False" and were happily served the "old" data.

By the time the key would have physically expired, it has already been refreshed.

Conclusion

The Thundering Herd problem isn't solved by making your database larger; it's solved by making your clients smarter.

Using Probabilistic Early Expiration allows you to treat cache expiration as a gradient rather than a binary event. By forcing a single client to pay the "computation tax" slightly early, you protect the entire system from the catastrophic latency that occurs when a hot key vanishes.

Implement this pattern for any cache key where the cost of a miss is high and the concurrency is significant.