The most insidious memory leaks in Go services rarely stem from large, cached structs or uncleared maps. Instead, they manifest as a slow, inexorable rise in resident set size (RSS) that persists despite aggressive garbage collection cycles.
In my experience, 80% of these cases are not "memory" leaks in the traditional sense. They are goroutine leaks.
When a goroutine blocks indefinitely, it never terminates. Because the Go runtime considers every active goroutine a root for garbage collection, the goroutine’s stack and every variable reachable from that stack cannot be freed. A 2KB stack can easily hold references to 20MB of heap data. If your service spawns 100 leaky goroutines per minute, you will OOM.
The Root Cause: Orphaned Concurrency
Under the hood, a goroutine is simply a managed thread of execution. When a goroutine enters a blocked state (e.g., waiting on a channel send/receive, acquiring a mutex, or waiting on network I/O), the Go scheduler parks it (runtime.gopark).
The leak occurs when the condition required to unpark the goroutine becomes impossible to satisfy. Common scenarios include:
- Nil Channels: Sending to or receiving from a
nilchannel blocks forever. - Abandoned Receivers: A goroutine tries to send to an unbuffered channel, but the receiver has already timed out or returned.
- Abandoned Senders: A goroutine waits to receive from a channel, but the sender exited without closing the channel.
The Garbage Collector cannot clean these up. It assumes that if a goroutine is blocked, it is waiting for an event that might still happen.
The Diagnosis: Profiling with pprof
To confirm a goroutine leak, you must inspect the runtime state. We use the standard library net/http/pprof.
1. Instrumentation
Ensure your service exposes pprof endpoints. If you are using net/http, this is often a one-line import.
import (
"net/http"
_ "net/http/pprof" // Registers handlers at /debug/pprof/
)
func main() {
// Start a diagnostic server on a separate port (e.g., 6060)
// to avoid exposing internals to the public internet.
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// ... start your main app code ...
}
2. Capture and Analysis
When memory usage is high, capture a goroutine profile. This snapshot shows the stack traces of all currently existing goroutines.
# Download and interactively explore the goroutine profile
go tool pprof -http=:8081 http://localhost:6060/debug/pprof/goroutine
This command opens a web view. Navigate to View -> Source.
If you see a count of thousands of goroutines stuck on the same line of code—typically a select statement or a channel operation—you have found your leak.
The Scenario: The "Abandoned Sender" Leak
A classic pattern involves a request handler spawning a worker to perform a task with a timeout. If the timeout triggers, the handler returns, but the worker is left trying to report its result to a receiver that no longer exists.
The Leaky Code
package main
import (
"log"
"net/http"
"time"
)
// externalCall simulates a slow database or API call
func externalCall() int {
time.Sleep(500 * time.Millisecond) // Simulates latency
return 200
}
func leakyHandler(w http.ResponseWriter, r *http.Request) {
// Unbuffered channel - the root of the problem
resultCh := make(chan int)
go func() {
// If leakyHandler returns before this line executes,
// this goroutine blocks FOREVER trying to write to resultCh.
resultCh <- externalCall()
}()
select {
case res := <-resultCh:
w.WriteHeader(res)
case <-time.After(200 * time.Millisecond):
// Timeout occurs. We return 504.
// The 'resultCh' variable goes out of scope here.
// However, the anonymous goroutine above still holds a reference to it.
w.WriteHeader(http.StatusGatewayTimeout)
}
}
func main() {
http.HandleFunc("/leak", leakyHandler)
log.Fatal(http.ListenAndServe(":8080", nil))
}
In the code above, when the timeout hits (200ms), the main handler exits. The anonymous goroutine wakes up at 500ms and attempts to write to resultCh. Since resultCh is unbuffered and there are no active readers, the send blocks. The goroutine never exits.
The Fix: Context Propagation and Buffered Channels
To fix this, we must ensure the background goroutine has an exit path even if the parent abandons it.
Solution 1: Buffer the Channel (The "Good Enough" Fix)
If the data being sent is small and you just want to prevent the block, make the channel buffered with a capacity of 1.
// A capacity of 1 allows the sender to drop the value and exit
// even if the receiver is gone.
resultCh := make(chan int, 1)
While this prevents the goroutine leak, it still allows the externalCall to execute to completion, wasting CPU and I/O resources.
Solution 2: Context Cancellation (The Robust Fix)
The production-grade solution is to pass a context.Context to the worker. This allows us to interrupt the work immediately when the request times out.
package main
import (
"context"
"log"
"net/http"
"time"
)
// externalCall now accepts context to support early termination
func externalCall(ctx context.Context) (int, error) {
// Simulate work using a select on the context
select {
case <-time.After(500 * time.Millisecond):
return 200, nil
case <-ctx.Done():
// Clean up nicely if cancelled
return 0, ctx.Err()
}
}
func fixedHandler(w http.ResponseWriter, r *http.Request) {
// 1. Create a child context with the timeout
ctx, cancel := context.WithTimeout(r.Context(), 200*time.Millisecond)
defer cancel() // Crucial: release resources associated with the context
// 2. Buffer channel to prevent blocking if we hit a race condition
// between context cancellation and channel send.
resultCh := make(chan int, 1)
go func() {
// Pass the context down
res, err := externalCall(ctx)
if err != nil {
// Context cancelled; simple return to exit goroutine
return
}
// 3. Non-blocking send or Select-with-default is safest
// inside generic workers, though the buffer above handles this specific case.
select {
case resultCh <- res:
default:
// Receiver is gone, log if necessary and exit
}
}()
select {
case res := <-resultCh:
w.WriteHeader(res)
case <-ctx.Done():
w.WriteHeader(http.StatusGatewayTimeout)
}
}
func main() {
http.HandleFunc("/fixed", fixedHandler)
log.Fatal(http.ListenAndServe(":8080", nil))
}
Why This Works
- Context Aware: The
externalCallfunction now listens toctx.Done(). If the HTTP request times out, thecontextis cancelled instantly. The worker stops processing immediately, saving CPU. - Exit Path: If
externalCallfinishes after the handler has returned (a race condition), theselectstatement or the buffered channel ensures the send operation does not block. - Resource Cleanup:
defer cancel()ensures the context tree is torn down as soon as the handler exits, regardless of the outcome.
Conclusion
Memory leaks in Go are rarely about memory; they are about lifecycle management. When you see RAM usage climbing, do not start by optimizing struct sizes. Start by counting your goroutines.
If you are spawning a goroutine, you must know exactly how and when it will exit. If it relies on a channel operation, ask yourself: "What happens if the other side of this channel disappears?" If the answer is "it blocks," you have a leak. Use pprof to find them, and context to fix them.