Skip to main content

Diagnosing Haskell Space Leaks: A Practical Guide to ghc-debug

 The most insidious failure mode in Haskell production systems is the slow-burning memory leak. Your application runs perfectly for days, but the Resident Set Size (RSS) creeps upward until the OOM killer terminates the process. Standard heap profiling (-hT) often changes the runtime characteristics enough to hide the bug (the "Heisenbug" effect) or requires restarting the process, destroying the state you need to inspect.

The modern solution is ghc-debug. This toolset allows you to connect to a running Haskell process, inspect the heap graph programmatically, and identify thunk buildup without stopping the world for extended periods or recompiling with heavy instrumentation.

The Root Cause: Thunks and WHNF

Haskell’s memory leaks are rarely "leaks" in the C/C++ sense (unfreed memory). They are almost always unwanted retention.

Because Haskell is lazy, an expression like acc + 1 is not evaluated immediately. It allocates a "thunk" (a closure representing the computation). If you store this thunk in a long-lived data structure (like a Map or State monad) without forcing it, the runtime builds a linked list of closures:

-- What you think you have:
10000

-- What you actually have in the heap:
((((0 + 1) + 1) + 1) ... + 1)

Data structures often only evaluate values to Weak Head Normal Form (WHNF). For a data constructor, WHNF means the constructor is known, but the fields inside might still be thunks. If your strictness annotations only force the outer layer, the inner thunks persist, retaining references to old data and growing the heap.

The Setup: A Leaky Application

Let's create a realistic scenario: a long-running service aggregating metrics. We will implement a Metrics type that looks correct but contains a subtle laziness bug.

1. The Leaky Service (Main.hs)

This application listens on a socket (simulated here via a loop) and updates a map of user metrics.

Dependencies: ghc-debug-stubcontainerstext

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE NumericUnderscores #-}

module Main where

import GHC.Debug.Stub (withGhcDebug)
import Control.Concurrent (threadDelay)
import qualified Data.Map.Strict as M
import Data.Text (Text)

-- THE BUG: Fields are lazy by default. 
-- Even though we use Data.Map.Strict, it only forces the 'Metrics' 
-- constructor to WHNF, leaving 'requestCount' as a growing thunk.
data Metrics = Metrics
  { requestCount :: Int
  , lastActive   :: Int
  } deriving (Show)

type State = M.Map Text Metrics

main :: IO ()
main = withGhcDebug $ do
    putStrLn "Starting server with ghc-debug enabled..."
    loop M.empty 0

loop :: State -> Int -> IO ()
loop state tick = do
    -- Simulate work: 10ms delay
    threadDelay 10_000 
    
    let user = "user_1"
    
    -- Update state
    let newState = M.alter (updateMetrics tick) user state
    
    -- Print stats every 1000 ticks so we know it's alive
    if tick `mod` 1000 == 0 
       then putStrLn $ "Tick: " <> show tick <> " | Heap growing..."
       else pure ()

    loop newState (tick + 1)

updateMetrics :: Int -> Maybe Metrics -> Maybe Metrics
updateMetrics tick Nothing = 
    Just $ Metrics 1 tick
updateMetrics tick (Just (Metrics count _)) = 
    -- This expression builds a thunk: (count + 1)
    Just $ Metrics (count + 1) tick

Run this application. It will begin consuming memory slowly but surely.

The Diagnosis: Inspecting with ghc-debug

Instead of guessing, we use ghc-debug-client to take a snapshot of the heap from a separate process. We will write a debugger script to find what is dominating memory.

2. The Debugger Script (Debugger.hs)

This script connects to the socket exposed by withGhcDebug, requests a heap graph, and performs a census.

Dependencies: ghc-debug-clientghc-debug-common

module Main where

import GHC.Debug.Client
import GHC.Debug.Retainers
import GHC.Debug.Profile
import qualified Data.List as L
import qualified Data.Map as Map

socketPath :: FilePath
socketPath = "/tmp/ghc-debug" -- Default location created by stub

main :: IO ()
main = withDebuggee Connect socketPath $ \d -> do
  putStrLn "Connected to application..."
  
  -- 1. Pause the app and request the root of the heap
  run d $ do
    putStrLn "Requesting heap census..."
    
    -- 2. Perform a census by closure type (Constructor name)
    -- This downloads a subset of the heap graph necessary for profiling
    c <- census2LevelClosureType d (const True)
    
    -- 3. Print top 10 heap objects
    let topObjects = take 10 $ L.sortOn (negate . countSize . snd) (Map.toList c)
    
    liftIO $ putStrLn "\n=== TOP HEAP OBJECTS (Count, Size) ==="
    liftIO $ mapM_ printObject topObjects

    -- 4. Advanced: Find what is retaining 'Int' thunks
    -- If we see high 'Int' count, we want to know WHO points to them.
    -- (In a real scenario, you'd target specific closure types found in step 3)
    
    precache d -- Speeds up traversal
    
    liftIO $ putStrLn "\n=== RETAINER ANALYSIS ==="
    -- Find up to 5 paths to closures of type "Int" (which are likely our thunks)
    retainers <- findRetainers d (Just 5) (isaConstructor "Int")
    liftIO $ displayRetainers retainers

printObject :: (String, Count) -> IO ()
printObject (name, Count n size) = 
  putStrLn $ name <> ": " <> show n <> " objects, " <> show size <> " bytes"

-- Helper to pretty print retainer stacks
displayRetainers :: [[ClosurePtr]] -> IO ()
displayRetainers [] = putStrLn "No retainers found."
displayRetainers (r:_) = do
    putStrLn "Example path to leaking object:"
    -- Note: Real implementation would decode ClosurePtrs to names
    -- visualizing the chain: Root -> Map -> Metrics -> Int (Thunk)
    print r 

3. Analysis Results

Running Debugger.hs while Main.hs is running produces output similar to this:

=== TOP HEAP OBJECTS (Count, Size) ===
Int: 50000 objects, 800000 bytes
Metrics: 1 objects, 24 bytes
...

Interpretation:

  1. We see a massive count of Int objects.
  2. In a strict application, Int values are often unboxed or shared. Seeing thousands of distinct Int closures usually indicates S# (Stack) thunks waiting to be evaluated.
  3. The retainer analysis (conceptual output) reveals a chain: Root -> ... -> Map -> Metrics -> Int.

The Metrics object exists, but inside it, the Int field is pointing to a massive chain of computations rather than a raw number.

The Fix: Strict Fields

The issue is that Data.Map.Strict evaluates the Metrics value to WHNF.

  1. Metrics (count + 1) tick is evaluated.
  2. The constructor Metrics is applied.
  3. The Reference to (count + 1) is stored in the first field.
  4. The addition is not performed.

To fix this, we must strictly enforce evaluation of the fields inside the data structure.

The Corrected Code

Modify the data definition in Main.hs:

-- FIX: Add bang patterns (!) to enforce strictness
data Metrics = Metrics
  { requestCount :: !Int  -- <--- Strict field
  , lastActive   :: !Int  -- <--- Strict field
  } deriving (Show)

Alternatively, if you cannot modify the type definition (e.g., it comes from a library), you must force evaluation before insertion:

-- Alternative Fix using DeepSeq
import Control.DeepSeq (($!!))

updateMetrics :: Int -> Maybe Metrics -> Maybe Metrics
updateMetrics tick Nothing = 
    Just $ Metrics 1 tick
updateMetrics tick (Just (Metrics count _)) = 
    -- ($!!) fully evaluates the argument to Normal Form before wrapping in Just
    Just $!! Metrics (count + 1) tick

Why This Works

When you add the bang pattern (!Int), GHC changes the memory representation of Metrics.

  1. Without Bangs: Metrics contains a pointer to a heap object which could be a evaluated Int or a thunk (1 + 1).
  2. With Bangs: When Metrics is constructed, the runtime must evaluate the arguments to WHNF immediately. Since Int's WHNF is the number itself, the addition executes immediately.
  3. Unpacking: With -O2, GHC will likely go further and "unpack" the Int directly into the Metrics object payload, removing the pointer entirely and reducing memory usage significantly.

Conclusion

Memory leaks in Haskell are almost always unforced thunks hiding in long-lived data structures. Tools like ghc-debug allow you to surgically identify these leaks in running processes without the guesswork of traditional heap profiling.

  1. Instrument your entry point with withGhcDebug.
  2. Snapshot the heap with a custom client script.
  3. Identify high-count closures (usually IntMaybe, or list nodes).
  4. Trace retainers to find the data structure holding them.
  5. Strictify fields with bang patterns to prevent thunk buildup.