Skip to main content

Elixir Production Pitfalls: Diagnosing GenServer Mailbox Overflows

 

The Silent Memory Killer

The most insidious failure mode in the BEAM (Erlang VM) is not the loud exception that crashes a process and restarts it. It is the silent accumulation of messages in a single GenServer's mailbox.

You have likely seen the symptoms:

  1. Latency Spikes: Requests involving a specific subsystem start timing out.
  2. Memory Bloat: Node memory usage climbs steadily (sawtooth pattern or linear) despite no apparent load increase.
  3. The Cliff: The node suddenly becomes unresponsive to heartbeats and gets killed by the orchestrator or OOM killer.
  4. No Stacktrace: Because the process didn't crash logic-wise—it just drowned—there is often no error log pointing to the culprit.

This occurs because standard GenServer.cast/2 is a "fire-and-forget" mechanism. If the producer sends messages faster than the consumer can process them, the mailbox (a linked list in memory) grows infinitely.

The Root Cause: Unbounded Mailboxes & GC Debt

In Erlang/Elixir, every process has a mailbox. When you send a message, it is appended to this list. The receiver process matches messages off the head of the list.

The failure mechanics are twofold:

  1. Processing Latency: If your handle_cast/2 takes 5ms, you have a hard ceiling of 200 messages per second. If traffic spikes to 250 req/sec, the queue grows by 50 messages every second.
  2. Garbage Collection Death Spiral: As the mailbox grows into the millions of items, the Garbage Collector (GC) has to traverse this massive list. The process spends more time GC-ing the pending messages than processing them. This slows down processing further, causing the mailbox to grow faster, increasing GC pressure—a catastrophic positive feedback loop.

Diagnosis: Pinpointing the Bottleneck

You cannot fix what you cannot find. In production, you rarely have the luxury of attaching a GUI observer. You need recon, a library safe for production diagnostics.

Add :recon to your mix.exs, or ensure it is available in your release.

Connect to your remote console and run the following. This returns the top 5 processes sorted by message queue length (message_queue_len).

# Connect to your production node via remote shell
# iex --name app@host --cookie <cookie> --remsh app@target

# Find the top 5 processes with the largest mailboxes
:recon.proc_count(:message_queue_len, 5)

Output Example:

[
  {<0.4502.0>, 1_250_400, [current_function: {MyDeep.Analytics.Aggregator, :handle_cast, 2}, registered_name: MyDeep.Analytics.Aggregator]},
  ...
]

Here, MyDeep.Analytics.Aggregator has 1.2 million pending messages. We have found our bottleneck.

The Fix: Partitioning and Sharding

The most robust solution for a single-process CPU bottleneck in modern Elixir (1.14+) is the PartitionSupervisor.

It solves the problem by splitting the single GenServer into N processes (usually equal to the number of scheduler threads/cores), automatically routing messages based on a key (like user_id or device_id). This preserves message ordering per-entity while parallelizing the total workload.

Step 1: The Bottleneck (Before)

This is the code causing the crash. It serializes all analytics events through one process.

defmodule MyApp.Analytics.Aggregator do
  use GenServer

  # CLIENT API
  
  def start_link(_opts) do
    GenServer.start_link(__MODULE__, [], name: __MODULE__)
  end

  # The Dangerous Cast: No backpressure, single consumer
  def track_event(user_id, event_data) do
    GenServer.cast(__MODULE__, {:track, user_id, event_data})
  end

  # SERVER CALLBACKS
  
  def handle_cast({:track, user_id, event_data}, state) do
    # Simulate expensive DB write or computation
    Process.sleep(5) 
    MyApp.Repo.insert!(%Event{user_id: user_id, data: event_data})
    {:noreply, state}
  end
end

Step 2: The Solution (PartitionSupervisor)

We will replace the singleton start_link with a PartitionSupervisor and route calls using the via tuple.

1. Update the Supervisor tree (Application.ex)

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    children = [
      MyApp.Repo,
      # Instead of starting the worker directly, we start the PartitionSupervisor
      {PartitionSupervisor,
       child_spec: MyApp.Analytics.Aggregator.child_spec([]),
       name: MyApp.Analytics.Aggregator.PartitionSwarm}
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

2. Refactor the Worker to be Partition-Aware

We change the track_event function to route messages to the correct partition based on user_id.

defmodule MyApp.Analytics.Aggregator do
  use GenServer

  # Use the standard child_spec, but remove the fixed global name 
  # so multiple instances can spawn under the PartitionSupervisor.
  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts)
  end

  # CLIENT API

  def track_event(user_id, event_data) do
    # ROUTING LOGIC:
    # We use the :via tuple. PartitionSupervisor hashes the key (user_id)
    # and sends the message to one of the managed child processes.
    
    GenServer.cast(
      {:via, PartitionSupervisor, {__MODULE__.PartitionSwarm, user_id}},
      {:track, user_id, event_data}
    )
  end

  # SERVER CALLBACKS

  def init(_opts) do
    # Optimization: Trap exits if you need cleanup, otherwise keep it simple
    {:ok, %{}}
  end

  def handle_cast({:track, user_id, event_data}, state) do
    # Processing is now distributed across all available cores.
    # If you have 8 cores, you now have 8x the throughput capacity.
    do_heavy_processing(user_id, event_data)
    {:noreply, state}
  end

  defp do_heavy_processing(user_id, data) do
    # Your business logic here
    MyApp.Repo.insert!(%Event{user_id: user_id, data: data})
  end
end

Why This Works

  1. Concurrency: By default, PartitionSupervisor spawns a number of children equal to System.schedulers_online() (the number of logical cores). If your bottleneck was CPU processing speed, you have immediately multiplied your throughput by the core count.
  2. Ordering: Standard round-robin load balancing breaks sequential guarantees (User A's "Logout" might be processed before their "Login" if sent to different workers). PartitionSupervisor uses consistent hashing on the term provided (e.g., user_id). All messages for User A always go to Worker_3, ensuring order is preserved.
  3. Isolation: If one specific partition gets overloaded by a "hot" key (a specific user spamming the system), only that partition's mailbox overflows. The other partitions remain responsive, limiting the blast radius of the failure.

Conclusion

When a GenServer becomes a bottleneck, standard scaling (adding more nodes) often fails to address the specific architectural choke point of a singleton process.

Before reaching for external message queues like Kafka or complex libraries like Broadway, leverage the built-in power of OTP. PartitionSupervisor allows you to scale constrained processes vertically across cores with minimal code changes, transforming a single point of failure into a resilient, concurrent processing layer.