In production BEAM environments, few errors trigger paging alerts as consistently as the dreaded GenServer timeout:
** (exit) {
{:timeout, {GenServer, :call, [MyServer, :request, 5000]}},
[{GenServer, :call, 3, [file: 'lib/gen_server.ex', line: 1013]}, ...]
}
When this occurs under high load, it rarely happens in isolation. It usually signals a cascading failure where a "singleton" process has become a bottleneck, causing callers to pile up, memory usage to spike, and eventually leading to a node restart.
Increasing the default 5000ms timeout is almost never the correct solution. It only delays the inevitable crash. You need to diagnose the mailbox congestion and architecturally decouple the ingestion of requests from their processing.
The Root Cause: The Actor Model Bottleneck
To fix this, we must understand the mechanics of GenServer.call/3.
- Synchronous Block: The calling process sends a message to the target process and enters a receive loop, setting a timer (default 5 seconds).
- Single-Threaded Execution: The target GenServer is a single Erlang process. It processes its mailbox sequentially, one message at a time.
- The Accumulation: If the arrival rate of messages ($\lambda$) exceeds the service rate ($\mu$) of the
handle_call/3callback, the mailbox grows indefinitely.
The Math of Failure: If your handle_call logic takes 10ms (e.g., a fast database write) and you receive 200 requests per second:
- Capacity: 100 req/sec.
- Load: 200 req/sec.
- Deficit: 100 messages accumulate per second.
After 5 seconds, the 500th message in the queue will have waited 5 seconds just to reach the front of the line. By the time the GenServer picks it up, the caller has already crashed with a timeout. The GenServer processes the work anyway, attempts to reply to a dead process (sending a benign _ message), and wastes CPU cycles processing a backlog of "dead" requests.
Phase 1: Diagnosis via Runtime Introspection
Before refactoring, confirm the mailbox is the culprit. In a live remote shell (IEx connected to production), inspect the suspect process.
# 1. Locate the process (by name or pid)
pid = Process.whereis(MyHighLoadServer)
# 2. Check the mailbox size and status
Process.info(pid, [:message_queue_len, :status, :current_function])
# Output: [message_queue_len: 15402, status: :running, current_function: {:gen_server, :loop, 7}]
If message_queue_len is high (thousands) or constantly growing, you have a throughput bottleneck.
To see what is clogging the mailbox, sample the messages:
# WARNING: This consumes messages. Only do this for debugging.
# Peek at the next 5 messages in the process inbox:
:erlang.process_info(pid, :messages)
|> elem(1)
|> Enum.take(5)
Phase 2: The Solution
We will address a common scenario: A StatsAggregator that receives synchronous updates from thousands of web sockets.
The Anti-Pattern
This implementation blocks the GenServer while performing I/O (database writes).
defmodule MyApp.StatsAggregator do
use GenServer
# ❌ BAD: Blocking call for high-throughput write
def handle_call({:record_stat, stat}, _from, state) do
# Simulate 50ms DB write latency
Process.sleep(50)
Repo.insert!(stat)
{:reply, :ok, state}
end
end
The Fix: Partitioning and Asynchronous Offloading
To solve this, we must do two things:
- Remove blocking I/O from the critical path of the
handle_call. - Partition the bottleneck so we aren't limited to a single CPU core.
We will use Elixir 1.14's PartitionSupervisor to shard the GenServer, and Task.Supervisor to handle the I/O asynchronously.
1. The Partitioned Architecture
Instead of one StatsAggregator, we spin up N aggregators (one per scheduler/core by default).
application.ex (Supervisor Tree):
defmodule MyApp.Application do
use Application
def start(_type, _args) do
children = [
{Task.Supervisor, name: MyApp.StatWriterTaskSup},
# Starts a dynamic pool of Aggregators
{PartitionSupervisor,
child_spec: MyApp.StatsAggregator.child_spec([]),
name: MyApp.StatsAggregatorPartition}
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
end
2. The Optimized GenServer
We change the interaction model. The handle_call should only update in-memory state (microseconds). The heavy lifting (persistence) moves to a background task or is batched.
Here, we use handle_continue to acknowledge the caller immediately, ensuring the timeout clock stops ticking, then process the write.
defmodule MyApp.StatsAggregator do
use GenServer
require Logger
# Client API
# The PartitionSupervisor automatically routes based on the `key`.
# We use the user_id or resource_id to ensure the same entity
# always goes to the same partition (preserving order if needed).
def record_stat(user_id, stat_data) do
GenServer.call(
{:via, PartitionSupervisor, {MyApp.StatsAggregatorPartition, user_id}},
{:record_stat, stat_data}
)
end
def start_link(opts) do
GenServer.start_link(__MODULE__, opts)
end
def init(_) do
{:ok, %{pending_writes: []}}
end
# ✅ GOOD: Fast acknowledgment.
# We reply :ok immediately, unblocking the caller.
# We verify valid data, but defer the I/O.
def handle_call({:record_stat, stat}, _from, state) do
# Move the actual work to handle_continue
{:reply, :ok, state, {:continue, {:persist, stat}}}
end
# This runs immediately after the reply, still inside this GenServer process.
# This blocks the NEXT message, but the current caller is already free.
def handle_continue({:persist, stat}, state) do
write_async(stat)
{:noreply, state}
end
# Alternative: If strict ordering isn't required,
# fire-and-forget to a Task Supervisor to utilize full parallelism.
defp write_async(stat) do
Task.Supervisor.start_child(MyApp.StatWriterTaskSup, fn ->
# Simulate DB Write
# Repo.insert!(stat)
Logger.info("Persisting stat: #{inspect(stat)}")
end)
end
end
Why This Works
- Backpressure Release: By moving
Repo.insert!out ofhandle_call, the time-to-reply drops from ~50ms to <1ms. The caller is released almost instantly, eliminating the timeout error. - Horizontal Scalability:
PartitionSupervisorcreates multiple GenServer instances (defaulting toSystem.schedulers_online()). If you have 8 cores, you now have 8 mailboxes processing in parallel. A single busy queue won't block the entire system. - Isolation: If one partition crashes due to a specific bad data payload, it only affects 1/N of your traffic. The rest of the system proceeds normally.
Conclusion
When a GenServer times out, it is a symptom of flow control failure. Do not increase the timeout.
- Diagnose: Check
process_info(pid, :message_queue_len). - Optimize: Make
handle_callreturn immediately. Usehandle_continuefor logic that must be sequential but doesn't require the client to wait. - Scale: Use
PartitionSupervisorto break singleton bottlenecks without introducing external infrastructure complexity.