Skip to main content

Hot-Debugging the BEAM: Tracing Erlang Processes in Live Production without Downtime

 Your metrics dashboard is screaming. Latency on the payment processing node has spiked to 5000ms. Memory usage is climbing vertically. The logs are paradoxically silent. You know a GenServer is stuck, or a message queue is overflowing, but you don't know which one.

In most runtimes, your only move is to capture a heap dump and restart the service, severing active connections and losing in-flight state.

The BEAM (Erlang VM) is different. It was designed for systems that cannot go down. You can surgically attach a remote shell to the running cluster, identify the rogue process, inspect its internal state, and even trace function calls in real-time—all without stopping the world.

Here is how to safely diagnose a zombie process in a high-throughput production environment.

The Root Cause: Mailboxes and Reductions

To fix a stuck BEAM node, you must understand how it breaks.

  1. Mailbox Overflow: Every process (Actor) has a mailbox. If a GenServer receives cast messages faster than it can process them, the mailbox grows indefinitely until the node runs out of RAM (OOM). This often happens when a synchronous call times out, but the caller retries immediately, flooding the receiver.
  2. Reduction exhaustion (CPU Hog): The BEAM uses preemptive scheduling based on "reductions" (roughly function calls). A process usually yields after 2000 reductions. However, NIFs (Native Implemented Functions) or specific tight loops can sometimes monopolize a scheduler thread, causing other processes on that core to starve.

Standard debugging tools like :sys.trace/2 or generic IO.inspect are dangerous in production. If you trace a high-frequency process, the I/O overhead of printing the logs will crash the node faster than the original bug.

We will use Recon, a library written by Fred Hebert specifically for safe production debugging. It enforces limits on trace outputs to prevent cascading failures.

The Fix: Surgical Introspection

Prerequisite: You must have access to the production network and the Erlang cookie.

1. Connect a Remote Shell

Do not SSH into the server and grep logs. Connect a local Erlang shell directly to the production node's VM.

# If you are using Elixir releases
./bin/my_app remote_console

# OR manually via IEx (requires network visibility)
# Replace 'cookie_secret' and 'live_node@ip' with actual values
iex --name debug@10.0.0.5 --cookie "cookie_secret" --remsh live_node@10.0.0.1

2. Identify the Bottleneck (Safe Enumeration)

Once connected, do not run :observer.start() if you are over a slow connection, and never run Process.list() if you have millions of processes.

Use :recon to find the top consumers. If :recon is not in your release, you can technically load it dynamically, but for this guide, we assume it is installed (it should be in every prod mix file).

Scenario A: High Memory (The Queue Builder) Find the top 5 processes by message queue length.

# In the remote console:
:recon.proc_count(:message_queue_len, 5)

Scenario B: High CPU (The Infinite Loop) Find the top 5 processes by reduction count (work done) over a sliding window of 1000ms.

:recon.proc_window(:reductions, 5, 1000)

Output Example:

[
  { #PID<0.4214.0>, 150000,
    [current_function: {MyApp.PaymentWorker, :process_transaction, 2},
     initial_call: {MyApp.PaymentWorker, :init, 1}]}
]

We found the culprit: #PID<0.4214.0>.

3. Inspect Internal State

Before tracing execution, look at the process metadata. This reveals what the process is currently running and the contents of its stack.

pid = :c.pid(0, 4214, 0) # Convert triple string to PID if needed, or paste literal

# 1. Get safe process info (filters huge terms)
Process.info(pid, [:current_stacktrace, :messages, :status])

# 2. Inspect the GenServer State
# WARNING: If the state is massive, this can hang your shell. 
# Set a timeout.
:sys.get_state(pid, 5000) 

If the status is :waiting, it is blocked on a receive. If it is :running, it is churning CPU.

4. Hot-Tracing Function Calls

The process is running, but logic is failing. We need to see the arguments coming into the function. We will use :recon_trace because it allows us to set a hard limit (e.g., "show me 10 calls, then stop tracing"). This prevents the "firehose" effect that kills servers.

Target: We suspect MyApp.PaymentWorker.calculate_tax/2 is receiving bad data.

# Syntax: :recon_trace.calls({Module, Function, MatchSpec}, MaxTraces, Options)

# Trace:
# 1. Module: MyApp.PaymentWorker
# 2. Function: :calculate_tax
# 3. Arity: :_ (any arity, or use specific integer)
# 4. Return traces: :return_trace (shows what the function returned)

:recon_trace.calls(
  {MyApp.PaymentWorker, :calculate_tax, :_}, 
  10, 
  [scope: :local] # 'local' is required to trace private function calls within the module
)

Output: The shell will print the next 10 calls to this function live as traffic hits the server.

14:23:01.442311 <0.4214.0> MyApp.PaymentWorker.calculate_tax(100.00, "US")
14:23:01.442500 <0.4214.0> MyApp.PaymentWorker.calculate_tax/2 --> {:ok, 5.00}

If you see the function entering but never returning (no --> arrow), you have located the precise freeze point.

5. Cleaning Up

Recon traces auto-terminate after the count (10) is reached. However, it is good practice to clear all trace patterns manually before disconnecting.

:recon_trace.clear()

The Explanation

Why Process.info isn't enough

Process.info(pid, :current_stacktrace) is a snapshot. In a distributed system, bugs often stem from the flow of data—race conditions where state changes A, B, and C happen in an unexpected order. Tracing captures the temporal dimension of the bug.

The Safety of recon vs :dbg

Erlang's built-in :dbg or :sys modules are powerful but dangerous. If you trace a function like Enum.map globally on a busy node, the VM attempts to send a trace message for every single execution to your shell. This floods the distribution port, causing the node to become unresponsive (the "Atom Bomb" effect).

:recon_trace mitigates this via:

  1. Rate Limiting: It stops after $N$ messages.
  2. Safety Valve: It uses a dedicated tracer process that disconnects if the trace overhead exceeds a CPU threshold.

Conclusion

The ability to SSH into a cluster, identify a specific rogue actor among millions, and watch it process data in real-time is a superpower unique to the BEAM. It turns "unexplainable" distributed system failures into solvable logic errors.

When your system is burning, don't restart it. Interrogate it.


Recommended Libraries for Production: