Streaming AI Responses in REST APIs using Server-Sent Events (SSE)

Building applications powered by Large Language Models (LLMs) introduces a unique latency problem. Standard REST APIs wait for the entire response payload to be generated before transmitting it to the client. When an LLM takes upwards of 30 seconds to generate a complex, multi-paragraph completion, the resulting user experience degrades rapidly. UIs freeze, users abandon the page, and load balancers frequently trigger 504 Gateway Timeouts.

To solve this, modern applications must stream AI response REST API payloads dynamically. By transmitting tokens to the client the moment they are generated, perceived latency drops from tens of seconds to milliseconds.

Understanding the Root Cause: Buffering vs. Streaming

Traditional HTTP request/response cycles rely on server-side buffering. When a client sends a POST request, the server allocates memory, processes the request, builds the complete JSON response object, and calculates the Content-Length header before sending the payload.

LLMs, however, are autoregressive. They generate text sequentially, one token at a time. Forcing an LLM pipeline into a standard HTTP response model means the server must hold the TCP connection open and buffer the growing string of tokens in memory. If a reverse proxy or load balancer sits between the client and the server (like AWS ALB or Nginx), the connection is often severed if the server remains idle without transmitting data for a set duration (typically 60 seconds).

Architectural Choice: Why Server-Sent Events LLM?

When choosing a streaming architecture, developers usually evaluate WebSockets versus Server-Sent Events (SSE).

WebSockets provide full-duplex, bidirectional communication. This is necessary for real-time multiplayer games or chat apps where both client and server send continuous messages. However, WebSockets introduce significant overhead. They require stateful connection tracking, complex load balancing (sticky sessions), and custom framing protocols.

A typical LLM request is inherently unidirectional. The client sends a prompt once, and the server replies with a stream of data. SSE operates over standard HTTP, making it the superior architectural choice. It utilizes Transfer-Encoding: chunked, requires zero additional networking infrastructure, and passes effortlessly through standard corporate firewalls.

Implementing an OpenAI SSE Backend (Node.js)

To construct a robust OpenAI SSE backend, the server must establish a persistent HTTP connection and format the output according to the text/event-stream specification.

Below is a production-ready Express.js implementation using TypeScript and the official OpenAI SDK.

import express, { Request, Response } from 'express';
import OpenAI from 'openai';

const app = express();
app.use(express.json());

// Initialize OpenAI client
const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

app.post('/api/chat', async (req: Request, res: Response) => {
  const { prompt } = req.body;

  // 1. Establish SSE headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  
  // Disable Nginx/Proxy buffering to ensure chunks are sent immediately
  res.setHeader('X-Accel-Buffering', 'no');

  const abortController = new AbortController();

  // 2. Handle client disconnects
  req.on('close', () => {
    abortController.abort();
    res.end();
  });

  try {
    // 3. Initiate the streaming completion
    const stream = await openai.chat.completions.create(
      {
        model: 'gpt-4o',
        messages: [{ role: 'user', content: prompt }],
        stream: true,
      },
      { signal: abortController.signal }
    );

    // 4. Iterate over the stream and format as SSE
    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content || '';
      if (content) {
        // SSE format requires "data: {payload}\n\n"
        const payload = JSON.stringify({ text: content });
        res.write(`data: ${payload}\n\n`);
      }
    }

    // 5. Signal completion to the client
    res.write('data: [DONE]\n\n');
    res.end();

  } catch (error: any) {
    if (error.name === 'AbortError') {
      console.log('Stream aborted by client');
    } else {
      console.error('LLM Generation Error:', error);
      res.write(`data: ${JSON.stringify({ error: 'Internal Server Error' })}\n\n`);
      res.end();
    }
  }
});

app.listen(8080, () => console.log('SSE API listening on port 8080'));

Consuming the Stream AI Response REST API (React Frontend)

The native browser EventSource API only supports GET requests. Because LLM prompts require passing complex, lengthy payloads, a POST request is mandatory. To achieve this, we use the modern Fetch API combined with the ReadableStream interface.

This React ChatGPT API streaming implementation safely handles multi-byte characters and incomplete network chunks.

import React, { useState, useRef } from 'react';

export default function AIStreamingChat() {
  const [prompt, setPrompt] = useState('');
  const [completion, setCompletion] = useState('');
  const [isGenerating, setIsGenerating] = useState(false);
  const abortControllerRef = useRef<AbortController | null>(null);

  const handleSubmit = async (e: React.FormEvent) => {
    e.preventDefault();
    setCompletion('');
    setIsGenerating(true);

    abortControllerRef.current = new AbortController();

    try {
      const response = await fetch('/api/chat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ prompt }),
        signal: abortControllerRef.current.signal,
      });

      if (!response.ok) throw new Error('Network response was not ok');
      if (!response.body) throw new Error('ReadableStream not supported');

      const reader = response.body.getReader();
      const decoder = new TextDecoder('utf-8');
      let buffer = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        // Use { stream: true } to prevent breaking multi-byte UTF-8 characters
        buffer += decoder.decode(value, { stream: true });
        
        // SSE chunks are separated by double newlines
        const chunks = buffer.split('\n\n');
        
        // The last chunk might be incomplete, keep it in the buffer
        buffer = chunks.pop() || '';

        for (const chunk of chunks) {
          if (chunk.startsWith('data: ')) {
            const dataStr = chunk.replace(/^data: /, '');
            
            if (dataStr === '[DONE]') {
              setIsGenerating(false);
              return;
            }

            try {
              const parsed = JSON.parse(dataStr);
              if (parsed.text) {
                // Update state securely using the previous state callback
                setCompletion((prev) => prev + parsed.text);
              }
            } catch (err) {
              console.error('Failed to parse stream chunk:', chunk);
            }
          }
        }
      }
    } catch (error: any) {
      if (error.name !== 'AbortError') {
        console.error('Streaming error:', error);
      }
    } finally {
      setIsGenerating(false);
    }
  };

  const handleStop = () => {
    if (abortControllerRef.current) {
      abortControllerRef.current.abort();
    }
  };

  return (
    <div className="max-w-2xl mx-auto p-4 flex flex-col gap-4">
      <form onSubmit={handleSubmit} className="flex gap-2">
        <input
          type="text"
          value={prompt}
          onChange={(e) => setPrompt(e.target.value)}
          placeholder="Enter your prompt..."
          className="flex-1 border p-2 rounded"
          disabled={isGenerating}
        />
        <button type="submit" disabled={isGenerating} className="bg-blue-600 text-white px-4 py-2 rounded">
          Send
        </button>
        {isGenerating && (
          <button type="button" onClick={handleStop} className="bg-red-600 text-white px-4 py-2 rounded">
            Stop
          </button>
        )}
      </form>
      
      <div className="bg-gray-50 p-4 rounded min-h-[100px] whitespace-pre-wrap">
        {completion}
      </div>
    </div>
  );
}

Deep Dive: How the Protocol Works

The integration between the Node.js backend and the React frontend works flawlessly due to a specific combination of HTTP protocols and binary data manipulation.

Transfer-Encoding and SSE Formatting

When the backend sets Content-Type: text/event-stream, the HTTP response does not declare a Content-Length. Instead, Node.js automatically utilizes Transfer-Encoding: chunked. This tells the network layer that the response will arrive in multiple unknown-sized pieces.

The SSE specification requires a rigid string format. Every payload must be prefixed with data: and terminated with two newline characters (\n\n). If a chunk is sent as data: {"text":"hello"}\n, the browser will ignore it until the second newline arrives.

Safely Decoding the Stream

On the client side, the network does not guarantee that TCP packets align perfectly with your SSE chunks. A single reader.read() operation might return half of a JSON string.

This is why the frontend implementation pushes the decoded string into a buffer variable, splits the buffer by \n\n, and always retains the final array element using chunks.pop(). The pop() method safely holds the incomplete fragment in memory until the next network packet arrives to complete the JSON string. Furthermore, passing { stream: true } to the TextDecoder ensures that emojis and special multi-byte characters are not corrupted if they happen to be split across two separate binary chunks.

Production Pitfalls and Edge Cases

Moving this architecture to production requires handling several network-layer edge cases.

Proxy Buffering Configuration

If you deploy this backend behind Nginx or an AWS Application Load Balancer, the stream might still feel delayed. This occurs because reverse proxies attempt to optimize network throughput by buffering responses until a certain byte threshold is reached.

You must explicitly disable this behavior. The header X-Accel-Buffering: no instructs Nginx to pass chunks to the client immediately. If configuring Nginx manually, ensure proxy_buffering off; is set for your API endpoint block.

Handling Client Disconnects Gracefully

Users frequently navigate away from pages or cancel requests before an LLM finishes generating. If the backend continues processing, you will bleed API credits and waste compute resources.

The backend code leverages an AbortController linked to req.on('close'). When the client connection drops, the controller emits an abort signal to the openai.chat.completions.create method. The OpenAI SDK natively respects this signal and immediately terminates the upstream HTTP request to the inference server, saving both money and memory.

Programming Tutorials

Search This Blog