Skip to main content

Hunting Thunks: Using ghc-debug to Fix Haskell Space Leaks in 2025

 Memory leaks in garbage-collected languages are annoying. Memory leaks in Haskell are existential threats.

The problem is rarely that you forgot to free memory. The problem is that you told the runtime not to calculate it yet. In production, this manifests as the sawtooth pattern of doom: memory usage climbs steadily, the Garbage Collector (GC) works harder and harder to traverse a growing graph of unevaluated computations (thunks), and eventually, the application pauses for seconds at a time before the OOM killer intervenes.

Traditional profiling (-p -hc) is invasive. It requires recompilation, changes runtime characteristics, and often distorts the very race conditions you are trying to catch.

In 2025, the standard for diagnosing these issues in production is ghc-debug. This tool allows you to snapshot the heap of a running executable, analyze the closure graph, and pinpoint exactly which unevaluated thunk is retaining gigabytes of memory.

The Root Cause: The Haystack of Thunks

To fix a space leak, you must understand the STG (Spineless Tagless G-machine) representation of your data.

When you write let x = a + b, GHC does not compute x. It allocates a Thunk on the heap. This thunk is a closure containing:

  1. A code pointer (to the + function).
  2. Pointers to the environment (variables a and b).

If x is never evaluated to Weak Head Normal Form (WHNF), that thunk remains. If x is part of a long-running recursive structure (like a state accumulator), you don't just have one thunk; you have a linked list of thunks, each pointing to the previous one.

The Space Leak: A thunk takes up small space (pointer + overhead). However, a thunk keeps its environment alive. If a tiny thunk refers to a 500MB ByteString that you thought you discarded, the GC cannot collect that ByteString. This is a "retainer" leak.

The Scenario: The "Strict" Map Trap

A common misconception is that using Data.Map.Strict eliminates space leaks. It only forces the keys and the spine of the map. It does not force the values inside the map unless the value type itself enforces strictness.

Consider this production service tracking metrics.

The Leaky Application

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}

module Main where

import GHC.Generics (Generic)
import qualified Data.Map.Strict as M
import Control.Concurrent (threadDelay)
import Control.Monad (forever)
import Data.IORef
import GHC.Debug.Stub (withGhcDebug) -- Dependency: ghc-debug-stub

-- THE PROBLEM DATA TYPE
-- Even though we put this in a Strict Map, the fields 'count' and 'score'
-- are lazy by default in Haskell.
data UserMetric = UserMetric
  { count :: Int
  , score :: Double
  } deriving (Show, Generic)

type State = M.Map String UserMetric

updateMetric :: UserMetric -> UserMetric
updateMetric (UserMetric c s) = 
  -- These additions create thunks because UserMetric is lazy
  UserMetric (c + 1) (s + 1.5)

main :: IO ()
main = withGhcDebug $ do
    putStrLn "Starting server with ghc-debug enabled..."
    ref <- newIORef M.empty
    
    -- Simulate high-throughput event loop
    forever $ do
        modifyIORef ref $ \m -> 
            M.insertWith (\_ old -> updateMetric old) "user_123" (UserMetric 1 1.0) m
        
        -- Artificial delay to allow us to attach the debugger
        threadDelay 1000 

If you run this, the heap grows indefinitely. The UserMetric values inside the map are not fully evaluated; they are building a chain of (1 + 1 + ...) thunks.

The Fix: Analysis with ghc-debug

Instead of guessing, we prove the leak. We use ghc-debug-client to connect to the running process via a socket.

Step 1: Running the Analysis

Create a separate analysis script (e.g., Debugger.hs). This script connects to the socket exposed by withGhcDebug in the main app.

module Main where

import GHC.Debug.Client
import GHC.Debug.Retainers
import GHC.Debug.Snapshot
import qualified Data.List as L

main :: IO ()
main = withDebugProbe "ghc-debug-socket" $ \trace -> do
    putStrLn "Connected to application..."
    
    -- 1. Pause the application and take a snapshot of the heap
    snapshot <- requestSnapshot trace
    
    -- 2. Analyze the heap graph
    let graph = snapshotGraph snapshot
    
    putStrLn $ "Heap size: " ++ show (heapGraphSize graph) ++ " objects"

    -- 3. Search for UserMetric closures
    -- We look for constructors matching our data type name
    let metrics = findClosure (=="UserMetric") graph
    
    case metrics of
        [] -> putStrLn "No UserMetric objects found."
        (c:_) -> do
            putStrLn $ "Found " ++ show (length metrics) ++ " UserMetric objects."
            
            -- 4. Check if they are thunks or values
            -- In a leak scenario, we expect to see specific closure types 
            -- or deep retention chains.
            
            -- 5. Find what is retaining these objects (The root of the evil)
            retainers <- findRetainers trace c
            putStrLn "Retainer stack trace for the first Metric:"
            mapM_ print retainers
            
            -- 6. Census: Count Thunks specifically
            -- This is the smoking gun.
            s <- census2LevelClosureType trace
            let thunkCount = L.lookup "Thunk" s
            putStrLn $ "Total Thunks in Heap: " ++ show thunkCount

Note: You run the leaky app in one terminal, and this debugger script in another. It communicates over the unix socket ghc-debug-socket.

The output evidence

When running the debugger against the leaky app, you will see the "Thunk" count rising linearly with the loop count. The UserMetric objects are technically in WHNF (the constructor UserMetric exists), but the fields inside them point to Thunk closures rather than Int or Double primitives.

The Solution: StrictData

We have identified that UserMetric is retaining thunks in its fields. We must enforce strictness at the data definition level.

While you can use bang patterns (!Int) on individual fields, the robust solution for data structures intended for state accumulation is the StrictData language extension. This forces fields to be evaluated to WHNF immediately upon construction.

The Corrected Code

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE StrictData #-} -- <--- THE FIX

module Main where

import GHC.Generics (Generic)
import qualified Data.Map.Strict as M
import Control.Concurrent (threadDelay)
import Control.Monad (forever)
import Data.IORef
import GHC.Debug.Stub (withGhcDebug)

-- With StrictData, Int and Double are unboxed/strict where possible
-- or at least evaluated to WHNF on construction.
data UserMetric = UserMetric
  { count :: Int
  , score :: Double
  } deriving (Show, Generic)

type State = M.Map String UserMetric

-- No changes needed here, but now this triggers evaluation
updateMetric :: UserMetric -> UserMetric
updateMetric (UserMetric c s) = 
  UserMetric (c + 1) (s + 1.5)

main :: IO ()
main = withGhcDebug $ do
    putStrLn "Starting optimized server..."
    ref <- newIORef M.empty
    
    forever $ do
        modifyIORef ref $ \m -> 
            -- insertWith combined with StrictData ensures
            -- the new value is forced before the map structure is updated.
            M.insertWith (\_ old -> updateMetric old) "user_123" (UserMetric 1 1.0) m
        threadDelay 1000

Why This Works

Before (Lazy Fields)

  1. M.insertWith calls updateMetric.
  2. updateMetric returns UserMetric (thunk_1) (thunk_2).
  3. Map.Strict evaluates the UserMetric constructor (WHNF).
  4. The fields remain pointers to unevaluated additions.
  5. Heap: Map -> UserMetric -> Thunk (+) -> Thunk (+) -> ...

After (StrictData)

  1. M.insertWith calls updateMetric.
  2. Because UserMetric has strict fields, constructing UserMetric (c+1) (s+1.5) forces the addition immediately.
  3. The thunks are consumed, the math is done, and the raw bits are stored (or pointers to evaluated primitives).
  4. Heap: Map -> UserMetric -> Int# / Double#

Conclusion

Space leaks are not mysterious ghosts; they are simply a disconnect between your mental model of data flow and the STG machine's execution strategy.

Don't blindly sprinkle bang patterns (!) hoping for the best. Use ghc-debug to visualize the heap graph. Once you see the chain of thunks retaining your memory, the fix becomes obvious: enforce strictness at the data definition boundary using StrictData or explicit bang patterns, ensuring your long-lived state contains values, not promises.