Skip to main content

Migrating from OpenAI Assistants API to Responses API: A Step-by-Step Guide

 The abstraction layer provided by the OpenAI Assistants API was a brilliant comprehensive tool for prototyping. It handled thread state, retrieval, and code interpretation automatically. However, with the architectural shift toward the granular Responses API (Stateless Inference) and Conversations API (State Management), developers must now decouple logic from storage.

Relying on the opaque Runs and Threads objects of the Assistants API often leads to "stuck" states, unpredictable latency due to polling mechanisms, and a lack of control over context window management.

This guide provides a rigorous migration path from the black-box Assistants API to a clean, custom architecture using the standard Chat Completions endpoint (acting as the Responses API) and a custom persistence layer (The Conversations API).

The Core Problem: Abstraction vs. Control

The "Assistants API" is essentially a wrapper around two distinct behaviors:

  1. State Management: Storing message history (Threads).
  2. Inference orchestration: looping through tool calls and managing tokens (Runs).

The deprecation roadmap signals a move toward explicitly separating these concerns. The Responses API focuses strictly on generating the next token sequence based on input, while the Conversations API is the architectural pattern where you, the developer, own the database.

Migrating solves three critical issues:

  1. Eliminate Polling: No more while(run.status !== 'completed'). Responses are streamed instantly.
  2. Context Control: You decide exactly which previous messages constitute the context window, optimizing costs.
  3. Debuggability: You can inspect the exact prompt sent to the model, unlike the hidden system prompts in Assistants.

Step 1: Architecting the Conversations API (The Schema)

In the Assistants API, OpenAI hosted your data. In this new architecture, you must persist the conversation history. We will use a standard SQL schema (represented here with Prisma ORM syntax) to replace the Thread object.

This ensures you own the user data, compliant with data residency requirements.

// schema.prisma

model Conversation {
  id        String    @id @default(uuid())
  userId    String
  createdAt DateTime  @default(now())
  messages  Message[]
  
  @@index([userId])
}

model Message {
  id             String       @id @default(uuid())
  conversationId String
  role           String       // 'system', 'user', 'assistant', 'tool'
  content        String       @db.Text
  toolCallId     String?      // For linking tool outputs to calls
  createdAt      DateTime     @default(now())
  
  conversation   Conversation @relation(fields: [conversationId], references: [id])
  
  @@index([conversationId])
}

Step 2: Implementing the Service Layer

We need a service that acts as the bridge between your database and OpenAI. This replaces the runs.create logic. We will build a strictly typed AgentService in TypeScript.

This solution uses the latest OpenAI Node.js SDK (v4+).

Prerequisites

Ensure you have the SDK installed:

npm install openai @prisma/client

The Migration Code

Here is the complete implementation of the "Responses" and "Conversations" loop. This replaces the Assistant's auto-iteration.

import OpenAI from 'openai';
import { PrismaClient } from '@prisma/client';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const prisma = new PrismaClient();

// Type definitions for clarity
type ChatMessage = OpenAI.Chat.Completions.ChatCompletionMessageParam;

export class AgentService {
  
  /**
   * The "Conversations API" - Retrieval and Context Management
   * Replaces: openai.beta.threads.messages.list()
   */
  private async getConversationContext(conversationId: string, limit: number = 10): Promise<ChatMessage[]> {
    const history = await prisma.message.findMany({
      where: { conversationId },
      orderBy: { createdAt: 'desc' },
      take: limit, // Optimization: Only load recent context
    });

    // Reverse to chronological order for the LLM
    return history.reverse().map(msg => ({
      role: msg.role as any,
      content: msg.content,
      tool_call_id: msg.toolCallId || undefined
    }));
  }

  /**
   * The "Responses API" - Inference and Execution
   * Replaces: openai.beta.threads.runs.create()
   */
  async generateResponse(conversationId: string, userPrompt: string) {
    // 1. Persist User Message (Stateful)
    await prisma.message.create({
      data: { conversationId, role: 'user', content: userPrompt }
    });

    // 2. Fetch Context (The Conversation)
    const context = await this.getConversationContext(conversationId);
    
    // 3. Define Tools (migrated from Assistant definition)
    const tools: OpenAI.Chat.Completions.ChatCompletionTool[] = [
      {
        type: 'function',
        function: {
          name: 'getCurrentWeather',
          description: 'Get weather for a location',
          parameters: {
            type: 'object',
            properties: { location: { type: 'string' } },
            required: ['location'],
          },
        },
      },
    ];

    // 4. Call OpenAI (The Response)
    const response = await openai.chat.completions.create({
      model: 'gpt-4-turbo-preview',
      messages: [
        { role: 'system', content: 'You are a helpful technical support assistant.' },
        ...context
      ],
      tools: tools,
      tool_choice: 'auto', 
    });

    const responseMessage = response.choices[0].message;

    // 5. Handle Tool Calls or Final Response
    if (responseMessage.tool_calls) {
      // Logic for tool execution goes here (Recursion required)
      return await this.handleToolExecution(conversationId, responseMessage, context);
    } else {
      // 6. Persist Assistant Response
      await prisma.message.create({
        data: { 
          conversationId, 
          role: 'assistant', 
          content: responseMessage.content || "" 
        }
      });
      
      return responseMessage.content;
    }
  }

  /**
   * Handling Recursive Tool Execution
   * This logic was previously hidden inside the "Run" object
   */
  private async handleToolExecution(
    conversationId: string, 
    message: OpenAI.Chat.Completions.ChatCompletionMessage,
    previousContext: ChatMessage[]
  ) {
    // Save the assistant's "intent" to call a tool
    await prisma.message.create({
      data: { 
        conversationId, 
        role: 'assistant', 
        content: JSON.stringify(message.tool_calls) // Storing raw calls for audit
      }
    });

    const toolMessages: ChatMessage[] = [];

    // Execute functions
    for (const toolCall of message.tool_calls!) {
      if (toolCall.function.name === 'getCurrentWeather') {
        const args = JSON.parse(toolCall.function.arguments);
        const weatherResult = `The weather in ${args.location} is 72°F and sunny.`; // Mock result

        // Create the tool output message
        toolMessages.push({
          role: 'tool',
          tool_call_id: toolCall.id,
          content: weatherResult
        });

        // Persist tool output
        await prisma.message.create({
          data: {
            conversationId,
            role: 'tool',
            content: weatherResult,
            toolCallId: toolCall.id
          }
        });
      }
    }

    // RECURSIVE CALL: Send tool outputs back to OpenAI for final answer
    const followUpResponse = await openai.chat.completions.create({
      model: 'gpt-4-turbo-preview',
      messages: [
        { role: 'system', content: 'You are a helpful assistant.' },
        ...previousContext,
        message, // The assistant message requesting tools
        ...toolMessages // The results of those tools
      ]
    });

    const finalContent = followUpResponse.choices[0].message.content;

    // Persist final answer
    await prisma.message.create({
      data: { conversationId, role: 'assistant', content: finalContent || "" }
    });

    return finalContent;
  }
}

Deep Dive: Why this Architecture Wins

1. The Latency Budget

The Assistants API Run object functions asynchronously. You create a run, then poll the endpoint runs.retrieve every 500ms to check if status matches requires_action or completed. This adds a minimum of 1-2 seconds of overhead latency to every interaction.

By moving to the Responses API (chat.completions), we receive the token stream immediately. Even if we aren't streaming to the client, we eliminate the network round-trip overhead of polling.

2. Token Optimization

The Assistants API automatically manages context, often stuffing the entire thread history into the model. This becomes prohibitively expensive as conversations grow.

In the getConversationContext method above, we applied a take: 10 filter. In a production environment, you would implement a smarter retrieval strategy—perhaps using RAG (Retrieval Augmented Generation) to fetch only semantically relevant past messages, rather than just the last 10. This allows for long-running conversations that don't hit the 128k context limit or drain your budget.

3. Deterministic State

When an Assistant Run fails, the Thread enters an inconsistent state. Recovering from a stuck run often requires cancelling the run or cloning the thread.

In our custom Conversations API, the database is the source of truth. If the API call fails, the database simply doesn't record the assistant response. The user can retry immediately without needing to "unlock" a thread.

Common Pitfalls and Edge Cases

Handling "Function Hallucinations"

In the Assistants API, OpenAI validates function arguments strictly before presenting them. When using the raw Responses API, the model might occasionally output invalid JSON for tool arguments.

Solution: Always wrap JSON.parse(toolCall.function.arguments) in a try/catch block. If parsing fails, feed a system error message back to the model asking it to correct the format.

Concurrency

The Assistants API locks a thread while a run is active. In your custom implementation, a user might send two messages quickly.

Solution: Implement Optimistic UI updates on the frontend, but on the backend, use a queue (like Redis or BullMQ) to process messages for a specific conversationId sequentially. This prevents race conditions where context is fetched before the previous answer is committed.

Conclusion

Migrating from the Assistants API to a custom "Responses + Conversations" architecture is not just about avoiding deprecation—it is about maturing your AI product. You gain lower latency, reduced costs through context management, and complete ownership of your data.

While the Assistants API served as an excellent scaffold, direct integration with the inference engine provides the control required for enterprise-grade stability.