Your metrics dashboard is screaming. Latency on the payment processing node has spiked to 5000ms. Memory usage is climbing vertically. The logs are paradoxically silent. You know a GenServer is stuck, or a message queue is overflowing, but you don't know which one.
In most runtimes, your only move is to capture a heap dump and restart the service, severing active connections and losing in-flight state.
The BEAM (Erlang VM) is different. It was designed for systems that cannot go down. You can surgically attach a remote shell to the running cluster, identify the rogue process, inspect its internal state, and even trace function calls in real-time—all without stopping the world.
Here is how to safely diagnose a zombie process in a high-throughput production environment.
The Root Cause: Mailboxes and Reductions
To fix a stuck BEAM node, you must understand how it breaks.
- Mailbox Overflow: Every process (Actor) has a mailbox. If a GenServer receives
castmessages faster than it can process them, the mailbox grows indefinitely until the node runs out of RAM (OOM). This often happens when a synchronouscalltimes out, but the caller retries immediately, flooding the receiver. - Reduction exhaustion (CPU Hog): The BEAM uses preemptive scheduling based on "reductions" (roughly function calls). A process usually yields after 2000 reductions. However, NIFs (Native Implemented Functions) or specific tight loops can sometimes monopolize a scheduler thread, causing other processes on that core to starve.
Standard debugging tools like :sys.trace/2 or generic IO.inspect are dangerous in production. If you trace a high-frequency process, the I/O overhead of printing the logs will crash the node faster than the original bug.
We will use Recon, a library written by Fred Hebert specifically for safe production debugging. It enforces limits on trace outputs to prevent cascading failures.
The Fix: Surgical Introspection
Prerequisite: You must have access to the production network and the Erlang cookie.
1. Connect a Remote Shell
Do not SSH into the server and grep logs. Connect a local Erlang shell directly to the production node's VM.
# If you are using Elixir releases
./bin/my_app remote_console
# OR manually via IEx (requires network visibility)
# Replace 'cookie_secret' and 'live_node@ip' with actual values
iex --name debug@10.0.0.5 --cookie "cookie_secret" --remsh live_node@10.0.0.1
2. Identify the Bottleneck (Safe Enumeration)
Once connected, do not run :observer.start() if you are over a slow connection, and never run Process.list() if you have millions of processes.
Use :recon to find the top consumers. If :recon is not in your release, you can technically load it dynamically, but for this guide, we assume it is installed (it should be in every prod mix file).
Scenario A: High Memory (The Queue Builder) Find the top 5 processes by message queue length.
# In the remote console:
:recon.proc_count(:message_queue_len, 5)
Scenario B: High CPU (The Infinite Loop) Find the top 5 processes by reduction count (work done) over a sliding window of 1000ms.
:recon.proc_window(:reductions, 5, 1000)
Output Example:
[
{ #PID<0.4214.0>, 150000,
[current_function: {MyApp.PaymentWorker, :process_transaction, 2},
initial_call: {MyApp.PaymentWorker, :init, 1}]}
]
We found the culprit: #PID<0.4214.0>.
3. Inspect Internal State
Before tracing execution, look at the process metadata. This reveals what the process is currently running and the contents of its stack.
pid = :c.pid(0, 4214, 0) # Convert triple string to PID if needed, or paste literal
# 1. Get safe process info (filters huge terms)
Process.info(pid, [:current_stacktrace, :messages, :status])
# 2. Inspect the GenServer State
# WARNING: If the state is massive, this can hang your shell.
# Set a timeout.
:sys.get_state(pid, 5000)
If the status is :waiting, it is blocked on a receive. If it is :running, it is churning CPU.
4. Hot-Tracing Function Calls
The process is running, but logic is failing. We need to see the arguments coming into the function. We will use :recon_trace because it allows us to set a hard limit (e.g., "show me 10 calls, then stop tracing"). This prevents the "firehose" effect that kills servers.
Target: We suspect MyApp.PaymentWorker.calculate_tax/2 is receiving bad data.
# Syntax: :recon_trace.calls({Module, Function, MatchSpec}, MaxTraces, Options)
# Trace:
# 1. Module: MyApp.PaymentWorker
# 2. Function: :calculate_tax
# 3. Arity: :_ (any arity, or use specific integer)
# 4. Return traces: :return_trace (shows what the function returned)
:recon_trace.calls(
{MyApp.PaymentWorker, :calculate_tax, :_},
10,
[scope: :local] # 'local' is required to trace private function calls within the module
)
Output: The shell will print the next 10 calls to this function live as traffic hits the server.
14:23:01.442311 <0.4214.0> MyApp.PaymentWorker.calculate_tax(100.00, "US")
14:23:01.442500 <0.4214.0> MyApp.PaymentWorker.calculate_tax/2 --> {:ok, 5.00}
If you see the function entering but never returning (no --> arrow), you have located the precise freeze point.
5. Cleaning Up
Recon traces auto-terminate after the count (10) is reached. However, it is good practice to clear all trace patterns manually before disconnecting.
:recon_trace.clear()
The Explanation
Why Process.info isn't enough
Process.info(pid, :current_stacktrace) is a snapshot. In a distributed system, bugs often stem from the flow of data—race conditions where state changes A, B, and C happen in an unexpected order. Tracing captures the temporal dimension of the bug.
The Safety of recon vs :dbg
Erlang's built-in :dbg or :sys modules are powerful but dangerous. If you trace a function like Enum.map globally on a busy node, the VM attempts to send a trace message for every single execution to your shell. This floods the distribution port, causing the node to become unresponsive (the "Atom Bomb" effect).
:recon_trace mitigates this via:
- Rate Limiting: It stops after $N$ messages.
- Safety Valve: It uses a dedicated tracer process that disconnects if the trace overhead exceeds a CPU threshold.
Conclusion
The ability to SSH into a cluster, identify a specific rogue actor among millions, and watch it process data in real-time is a superpower unique to the BEAM. It turns "unexplainable" distributed system failures into solvable logic errors.
When your system is burning, don't restart it. Interrogate it.
Recommended Libraries for Production:
- Recon: github.com/ferd/recon
- Observer CLI: github.com/zhongwencool/observer_cli (For a text-based UI similar to htop)