Node.js Streams: Implementing Backpressure to Avoid Heap Out of Memory

There are few sights more frustrating in a Node.js backend environment than checking your logs and seeing the dreaded crash signature:

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

This error frequently occurs during ETL operations, such as processing multi-gigabyte CSVs, parsing JSON streams, or migrating database records. While increasing the memory limit (--max-old-space-size) offers a temporary band-aid, it is not a solution. It merely delays the inevitable crash as your dataset grows.

The root cause is rarely the file size itself, but rather a mismanagement of backpressure. When your reading stream (the Producer) pushes data into the internal buffer faster than your writing stream (the Consumer) can process it, the memory usage balloons until the V8 heap is exhausted.

The Mechanics of a Heap Overflow

To solve the problem, we must understand the architecture of Node.js streams.

All Node.js streams (Readable, Writable, Transform) utilize an internal buffer. The size of this buffer is dictated by the highWaterMark option (defaulting to 16KB for object mode and 64KB for buffers).

Under normal operation, a stream flows data efficiently. However, a mismatch in speed creates a bottleneck:

The Producer (e.g., fs.createReadStream) reads from the disk at incredibly high speeds (hundreds of MB/s).
The Consumer (e.g., writing to a DB, sending an HTTP request, or complex parsing) is significantly slower (I/O or CPU bound).
The Failure: If the Producer does not "pause" when the Consumer is busy, Node.js buffers the pending chunks in memory (RAM).
The Crash: These buffered chunks accumulate in the "Old Space" of the V8 heap. Once this space fills up, the Garbage Collector (GC) goes into overdrive, freezing the application, before finally crashing with an OOM error.

The Solution: Implementing Backpressure

Backpressure is the feedback mechanism that allows the Consumer to signal the Producer to stop sending data until the current backlog is processed.

While stream.pipe() handles this automatically for simple file copies, complex logic (like parsing CSV rows and validating data asynchronously) often breaks the pipe chain or requires manual handling.

Below is a robust, production-ready implementation using Node.js Pipelines and Async Iterators. This approach is superior to legacy pipe() because it automatically handles error cleanup and supports modern async/await flows.

Prerequisites

Ensure you are running Node.js 18+ (LTS) to fully utilize native stream/promises and global AbortController support if needed.

1. The Setup: Simulating the Bottleneck

First, let's look at the robust architecture. We will simulate a slow asynchronous processing task (like a database write) to prove the memory remains stable.

import { createReadStream, createWriteStream } from 'node:fs';
import { pipeline } from 'node:stream/promises';
import { Transform } from 'node:stream';
import { createInterface } from 'node:readline';

// Configuration
const INPUT_FILE = './large-dataset.csv'; // Assume a 5GB file
const OUTPUT_FILE = './processed-output.json';

// Simulated database delay (ms)
const DB_WRITE_LATENCY = 10;

2. The Transformer: Processing with Backpressure

We will create a Transform stream. The critical component here is the transform callback. In legacy streams, you would call callback() to signal readiness. In modern streams utilizing pipelines, the internal state management handles this, provided we respect the asynchronous flow.

However, a more readable approach for modern Node.js is utilizing Async Generators. This allows us to treat the stream as an iterable, where the await keyword automatically creates backpressure.

/**
 * A generator that parses raw lines into JSON objects.
 * This acts as our logical transformation layer.
 */
async function* csvToJsonIterator(sourceStream) {
  const rl = createInterface({
    input: sourceStream,
    crlfDelay: Infinity,
  });

  let isHeader = true;
  let headers = [];

  for await (const line of rl) {
    if (isHeader) {
      headers = line.split(',');
      isHeader = false;
      continue;
    }

    const values = line.split(',');
    const record = headers.reduce((obj, header, index) => {
      obj[header.trim()] = values[index]?.trim();
      return obj;
    }, {});

    yield record;
  }
}

/**
 * A generator that simulates a slow writable process (e.g., DB Insert).
 * It yields strings ready to be written to the output file.
 */
async function* processAndWriteIterator(iterator) {
  let count = 0;
  
  for await (const record of iterator) {
    // Simulate heavy async processing / DB Backpressure
    await new Promise((resolve) => setTimeout(resolve, DB_WRITE_LATENCY));
    
    // Transform to final string format for the file
    yield JSON.stringify(record) + '\n';
    
    count++;
    if (count % 1000 === 0) {
      // Log memory usage to prove stability
      const used = process.memoryUsage().heapUsed / 1024 / 1024;
      console.log(`Processed ${count} records. Heap used: ${Math.round(used)} MB`);
    }
  }
}

3. The Execution: Wiring it with `stream.pipeline`

We use pipeline from node:stream/promises. This is the gold standard for connecting streams. It properly destroys all streams if one fails, preventing file descriptor leaks—a common issue when using pipe().

async function runETL() {
  console.time('ETL Process');
  
  try {
    const source = createReadStream(INPUT_FILE, { 
      encoding: 'utf8',
      // HighWaterMark tuning: 
      // Larger allows more throughput, smaller saves memory.
      // 64kb is a balanced default for text.
      highWaterMark: 64 * 1024 
    });

    const destination = createWriteStream(OUTPUT_FILE);

    // pipeline signature: source -> transform(s) -> destination
    await pipeline(
      source,
      csvToJsonIterator,       // Converts Buffer chunks to Objects
      processAndWriteIterator, // Handles Async logic + Backpressure
      destination              // Writes to disk
    );

    console.log('Pipeline succeeded.');
  } catch (error) {
    console.error('Pipeline failed:', error);
  } finally {
    console.timeEnd('ETL Process');
  }
}

// Execute
runETL();

Deep Dive: Why This Works

The magic lies in the for await...of loop within the generators.

Implicit Pausing: When the processAndWriteIterator hits the line await new Promise(...), execution effectively pauses inside that function scope.
Signal Propagation: Because the generator is not pulling the next item from the previous iterator (csvToJsonIterator), the previous iterator pauses at its own yield.
Readline Integration: The readline interface detects that its consumer is not asking for data. It stops pulling data from the source (file system).
Kernel Level: Eventually, the OS TCP/IP buffer or File Descriptor buffer fills up. The kernel tells the disk I/O to stop reading.

The memory profile remains flat. Instead of loading 5GB into RAM, Node.js only holds the chunk currently being processed plus the small internal buffers (highWaterMark) of the streams involved.

Manual Backpressure Handling (The "Under the Hood" View)

If you are writing a custom class extending Writable and cannot use async iterators, you must understand the write() return value.

Here is how backpressure is manually implemented without generators. This is often required when interfacing with legacy libraries or drivers.

import { Writable } from 'node:stream';

class SlowDbWriter extends Writable {
  constructor(options) {
    super({ ...options, objectMode: true });
  }

  _write(chunk, encoding, callback) {
    // 1. Perform the async operation
    fakeDbInsert(chunk)
      .then(() => {
        // 2. Signal that we are ready for more data
        callback(); 
      })
      .catch((err) => {
        // 3. Propagate errors correctly
        callback(err);
      });
  }
}

// Usage in a manual loop (Simulated)
// logic:
// const canWrite = stream.write(data);
// if (!canWrite) stream.once('drain', resumeReading);

When implementing _write, strictly ensure callback is invoked only when the asynchronous operation completes. If you call callback immediately while the promise is still pending, you defeat backpressure, causing the upstream to flood your process with data it assumes you have finished handling.

Common Pitfalls and Edge Cases

1. Mixing `async/await` with Event Emitters

A common mistake is attaching an async function to a standard event listener:

// DON'T DO THIS
myStream.on('data', async (chunk) => {
  await saveToDb(chunk); // The stream does NOT wait for this!
});

This fails because the event emitter does not await the handler. It fires 'data' events as fast as possible, flooding the event loop and ignoring the promise returned by the handler. Always use pipeline or for await...of to enforce sequential processing.

2. Ignoring `highWaterMark`

If you are processing tiny rows (e.g., a CSV with 2 columns), the default highWaterMark (16 objects in Object Mode) might be too small, causing excessive CPU switching. If processing 4MB JSON blobs, the default might be too large, risking OOM.

Small objects: Increase highWaterMark to 512 or 1024.
Large objects: Decrease highWaterMark to 1 or 2.

3. Zombie Streams

When using standard pipe(), if the destination stream emits an error, the source stream is not automatically closed. This leaves file descriptors open (FD leak). Always use stream.pipeline() (callback or promise version), which attaches error handlers to all streams in the chain and ensures proper teardown.

Conclusion

Node.js is capable of processing datasets terabytes in size with a minimal memory footprint (often under 50MB of RAM). The key is trusting the stream architecture.

By moving away from manual event listeners (on('data')) and embracing Async Iterators wrapped in stream.pipeline, you ensure that your application respects the processing limits of your infrastructure, eliminating Heap Out of Memory errors permanently.

Programming Tutorials

Search This Blog