The Silent Memory Killer
The most insidious failure mode in the BEAM (Erlang VM) is not the loud exception that crashes a process and restarts it. It is the silent accumulation of messages in a single GenServer's mailbox.
You have likely seen the symptoms:
- Latency Spikes: Requests involving a specific subsystem start timing out.
- Memory Bloat: Node memory usage climbs steadily (sawtooth pattern or linear) despite no apparent load increase.
- The Cliff: The node suddenly becomes unresponsive to heartbeats and gets killed by the orchestrator or OOM killer.
- No Stacktrace: Because the process didn't crash logic-wise—it just drowned—there is often no error log pointing to the culprit.
This occurs because standard GenServer.cast/2 is a "fire-and-forget" mechanism. If the producer sends messages faster than the consumer can process them, the mailbox (a linked list in memory) grows infinitely.
The Root Cause: Unbounded Mailboxes & GC Debt
In Erlang/Elixir, every process has a mailbox. When you send a message, it is appended to this list. The receiver process matches messages off the head of the list.
The failure mechanics are twofold:
- Processing Latency: If your
handle_cast/2takes 5ms, you have a hard ceiling of 200 messages per second. If traffic spikes to 250 req/sec, the queue grows by 50 messages every second. - Garbage Collection Death Spiral: As the mailbox grows into the millions of items, the Garbage Collector (GC) has to traverse this massive list. The process spends more time GC-ing the pending messages than processing them. This slows down processing further, causing the mailbox to grow faster, increasing GC pressure—a catastrophic positive feedback loop.
Diagnosis: Pinpointing the Bottleneck
You cannot fix what you cannot find. In production, you rarely have the luxury of attaching a GUI observer. You need recon, a library safe for production diagnostics.
Add :recon to your mix.exs, or ensure it is available in your release.
Connect to your remote console and run the following. This returns the top 5 processes sorted by message queue length (message_queue_len).
# Connect to your production node via remote shell
# iex --name app@host --cookie <cookie> --remsh app@target
# Find the top 5 processes with the largest mailboxes
:recon.proc_count(:message_queue_len, 5)
Output Example:
[
{<0.4502.0>, 1_250_400, [current_function: {MyDeep.Analytics.Aggregator, :handle_cast, 2}, registered_name: MyDeep.Analytics.Aggregator]},
...
]
Here, MyDeep.Analytics.Aggregator has 1.2 million pending messages. We have found our bottleneck.
The Fix: Partitioning and Sharding
The most robust solution for a single-process CPU bottleneck in modern Elixir (1.14+) is the PartitionSupervisor.
It solves the problem by splitting the single GenServer into N processes (usually equal to the number of scheduler threads/cores), automatically routing messages based on a key (like user_id or device_id). This preserves message ordering per-entity while parallelizing the total workload.
Step 1: The Bottleneck (Before)
This is the code causing the crash. It serializes all analytics events through one process.
defmodule MyApp.Analytics.Aggregator do
use GenServer
# CLIENT API
def start_link(_opts) do
GenServer.start_link(__MODULE__, [], name: __MODULE__)
end
# The Dangerous Cast: No backpressure, single consumer
def track_event(user_id, event_data) do
GenServer.cast(__MODULE__, {:track, user_id, event_data})
end
# SERVER CALLBACKS
def handle_cast({:track, user_id, event_data}, state) do
# Simulate expensive DB write or computation
Process.sleep(5)
MyApp.Repo.insert!(%Event{user_id: user_id, data: event_data})
{:noreply, state}
end
end
Step 2: The Solution (PartitionSupervisor)
We will replace the singleton start_link with a PartitionSupervisor and route calls using the via tuple.
1. Update the Supervisor tree (Application.ex)
defmodule MyApp.Application do
use Application
def start(_type, _args) do
children = [
MyApp.Repo,
# Instead of starting the worker directly, we start the PartitionSupervisor
{PartitionSupervisor,
child_spec: MyApp.Analytics.Aggregator.child_spec([]),
name: MyApp.Analytics.Aggregator.PartitionSwarm}
]
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)
end
end
2. Refactor the Worker to be Partition-Aware
We change the track_event function to route messages to the correct partition based on user_id.
defmodule MyApp.Analytics.Aggregator do
use GenServer
# Use the standard child_spec, but remove the fixed global name
# so multiple instances can spawn under the PartitionSupervisor.
def start_link(opts) do
GenServer.start_link(__MODULE__, opts)
end
# CLIENT API
def track_event(user_id, event_data) do
# ROUTING LOGIC:
# We use the :via tuple. PartitionSupervisor hashes the key (user_id)
# and sends the message to one of the managed child processes.
GenServer.cast(
{:via, PartitionSupervisor, {__MODULE__.PartitionSwarm, user_id}},
{:track, user_id, event_data}
)
end
# SERVER CALLBACKS
def init(_opts) do
# Optimization: Trap exits if you need cleanup, otherwise keep it simple
{:ok, %{}}
end
def handle_cast({:track, user_id, event_data}, state) do
# Processing is now distributed across all available cores.
# If you have 8 cores, you now have 8x the throughput capacity.
do_heavy_processing(user_id, event_data)
{:noreply, state}
end
defp do_heavy_processing(user_id, data) do
# Your business logic here
MyApp.Repo.insert!(%Event{user_id: user_id, data: data})
end
end
Why This Works
- Concurrency: By default,
PartitionSupervisorspawns a number of children equal toSystem.schedulers_online()(the number of logical cores). If your bottleneck was CPU processing speed, you have immediately multiplied your throughput by the core count. - Ordering: Standard round-robin load balancing breaks sequential guarantees (User A's "Logout" might be processed before their "Login" if sent to different workers).
PartitionSupervisoruses consistent hashing on the term provided (e.g.,user_id). All messages for User A always go toWorker_3, ensuring order is preserved. - Isolation: If one specific partition gets overloaded by a "hot" key (a specific user spamming the system), only that partition's mailbox overflows. The other partitions remain responsive, limiting the blast radius of the failure.
Conclusion
When a GenServer becomes a bottleneck, standard scaling (adding more nodes) often fails to address the specific architectural choke point of a singleton process.
Before reaching for external message queues like Kafka or complex libraries like Broadway, leverage the built-in power of OTP. PartitionSupervisor allows you to scale constrained processes vertically across cores with minimal code changes, transforming a single point of failure into a resilient, concurrent processing layer.