ℹ New API Surface

The OpenAI Responses API launched in early 2025 as the recommended path for new agentic applications. The Assistants API is deprecated with an EOL of August 26, 2026. The Agents SDK (Python + TypeScript) is the official open-source framework built on top of it.

OpenAI Responses API & Agents SDK

The Responses API is OpenAI's unified inference interface designed for agentic applications. It combines model inference, built-in tool execution, and optional stateful conversation management in a single API surface. The companion Agents SDK provides a lightweight framework for multi-agent orchestration — agents, handoffs, guardrails, and tracing — without requiring a full third-party orchestration layer like LangChain or LangGraph.

Together, they represent OpenAI's opinionated path for building production AI agents that stay within the OpenAI ecosystem. We have production experience with both and work with teams migrating from the deprecated Assistants API.


Responses API: Core Concepts

Unified Tool Execution

Built-in tools — web_search_preview, file_search, computer_use_preview — are first-class citizens of the API. You declare which tools are available; the model decides when to invoke them; OpenAI executes them and returns results as part of the response. No polling loops, no separate Run state machine.

Stateful Conversations

Pass store: true on any call and OpenAI stores the response server-side. On subsequent turns, pass previous_response_id instead of a full message array. OpenAI reconstructs context automatically. Opt-out anytime by returning to manual message history management — the two patterns interoperate.

Streaming First-Class

All response types — text, tool calls, reasoning summaries — stream as typed events. The ResponseStream abstraction in the SDK surfaces events like response.text.delta, response.tool_call.arguments.delta, and lifecycle events. Streaming is not an afterthought; it is the primary design mode.

Reasoning Models

The o1, o3, and o3-mini series are fully supported. Set reasoning.effort to low, medium, or high to trade latency for answer quality. Streaming reasoning summaries (not full chain-of-thought) are available where models support it.

Structured Outputs

JSON Schema enforcement is built into the API via text.format. When you define a schema, the model is constrained to produce valid JSON matching it. This removes the need for output parsers and retry logic, and makes downstream validation deterministic.

Background Mode

Long-running agentic tasks (multi-step research, code generation, document analysis) can run asynchronously via background: true. You receive a response ID immediately and poll for completion — a correct pattern for tasks that exceed HTTP timeout limits or need to survive client disconnects.


Built-in Tools

Web Search

web_search_preview gives the model live web access. Results are cited inline with source URLs. Useful for agents that need current information beyond training data cutoffs — news analysis, pricing research, documentation lookup. The model decides when a query warrants a search; you can also force tool use via tool_choice.

File Search

Semantic search over Vector Stores you create via the Files API. Supports hybrid search (keyword + semantic), multiple vector stores per call, and attribute filters. Replaces Assistants API File Search with the same underlying infrastructure but a simpler invocation model. Max 10,000 files per vector store.

Computer Use (Preview)

computer_use_preview allows the model to control a desktop or browser environment — click, type, scroll, screenshot. Built on the CUA (Computer Use Agent) model. Appropriate for automation workflows where structured APIs don't exist. Currently preview; production use requires careful sandboxing and human oversight.

Function Calling (Custom Tools)

Your own tools alongside built-in ones. Define JSON Schema function specs; the model generates valid arguments; your code executes and returns results. The Responses API processes function results within the same request-response cycle via the tool_result input type — simpler than the Assistants API Runs pattern.


OpenAI Agents SDK

The Agents SDK (openai-agents in Python, @openai/agents in TypeScript/JS) is an open-source framework that wraps the Responses API with primitives for multi-agent orchestration. It is intentionally lightweight — no graph topology required, no proprietary abstractions — designed to be readable and easy to extend.

Agents

The core primitive. An Agent is a named configuration object: system instructions, a model, a list of tools, and optional output type (for structured outputs). Agents are stateless — all state lives in the Runner context. You can define dozens of specialized agents (triage, research, writer, validator) and wire them together via handoffs.

Handoffs

Agents can transfer control to other agents via handoff tools. The current agent calls transfer_to_[agent_name], passes a context message, and execution continues with the target agent in the same conversation thread. This is how you build triage → specialist workflows. Handoffs are implemented as regular function calls — fully transparent in traces.

Guardrails

Parallel safety and validation checks that run alongside the main agent response. An input guardrail fires before the main agent response — useful for topic classification, PII detection, or abuse screening. An output guardrail fires after — useful for policy compliance, hallucination checks, or format validation. Guardrails can trip a GuardrailTripwireTriggered exception that halts the workflow.

Runner & Context

The Runner.run() method executes an agent loop: call model, process tool calls, repeat until a final output or handoff. The RunContext carries user-defined state across the entire run — database connections, user session data, accumulated results. Context is typed and dependency-injected; agents receive it via tool function signatures.

Tracing

Every SDK run generates a full trace: agent invocations, tool calls, handoff decisions, guardrail evaluations, token usage. Traces send to the OpenAI platform by default (visible in the dashboard) and can be exported to third-party observability tools. This is the main observability advantage of the SDK vs. raw API calls.

Voice Pipeline

The SDK includes a VoicePipeline abstraction for speech-to-speech agent applications: STT → agent → TTS, with streaming audio in and out. Useful for phone agents, voice assistants, and real-time meeting bots. Audio input goes through Whisper; output through TTS; the agent runs on the Responses API in between.


Responses API vs. Assistants API

Responses APIAssistants API (deprecated)
StatusActive — recommended pathDeprecated — EOL Aug 26, 2026
Latency1–3 seconds TTFT4–28 seconds (reported 2025)
Tool executionInline — same request cycleAsync Runs with polling
Conversation stateOptional (previous_response_id)Managed (Thread IDs in your DB)
Built-in web searchYes (web_search_preview)No
Cost modelToken-based only+File re-processing per Run
ObservabilityFull streaming event traceLimited Run inspection
SDK supportAgents SDK (Python + TS)No official framework

Production Issues We've Encountered

Store Mode Data Residency

When store: true is set, conversation history is stored in OpenAI's infrastructure. For healthcare, legal, or financial applications with data governance requirements, this may be incompatible with compliance obligations.

Our approach: Default to manual message history management with store: false. Maintain conversation history in your own database, giving you full control, portability, and auditability. Only use store mode for non-sensitive, internal tooling.

Web Search Hallucinated Citations

The web_search_preview tool occasionally returns citations that do not match the actual retrieved content — particularly on low-traffic URLs or paywalled sources the model has seen in training data.

Our approach: Always surface citations to end users with clickable links — let humans verify. For automated pipelines, treat web search as a signal, not ground truth. Implement a validation step that fetches and spot-checks cited URLs before including them in downstream outputs.

Handoff Loops in Multi-Agent Workflows

Agents can get into circular handoff patterns — agent A defers to agent B, which defers back to A, indefinitely. The SDK does not enforce a hard handoff depth limit by default. This surfaces as a hanging run rather than an error.

Our approach: Set explicit max_turns on the Runner. Design triage agents with clear escalation rules and a “cannot resolve” fallback path. Monitor handoff counts per run in traces and alert on runs exceeding your expected depth.

Guardrail Latency Overhead

Guardrails run in parallel but still add measurable latency — particularly output guardrails that must wait for the primary response to complete before evaluating. A guardrail that calls a second model adds a full inference round-trip to every user interaction.

Our approach: Use fast, cheap models (gpt-4o-mini, o3-mini low) for guardrails. Design guardrails as binary classifiers, not complex evaluators. For latency-sensitive applications, consider async guardrail evaluation with post-hoc intervention rather than blocking the user response.

Background Mode Polling Complexity

Background runs require your application to manage polling — request status, handle timeouts, surface partial results, and recover from failures. Teams underestimate the infrastructure required: job queue, status persistence, webhook or SSE for completion notification.

Our approach: Treat background runs as async jobs from day one. Use a proper job queue (BullMQ, Celery, or similar) to manage background run lifecycles. Store run IDs with job metadata; implement idempotent polling with exponential backoff; surface progress to users via a status endpoint rather than leaving them with a spinner.

Context Injection Costs at Scale

Agentic runs accumulate context rapidly — tool call results, intermediate reasoning, handoff messages. A 10-step agent run with moderate tool outputs can easily consume 50,000+ input tokens. At GPT-4o pricing, this makes per-session costs an order of magnitude higher than simple chat applications.

Our approach: Model each agent run's token budget before deployment. Use gpt-4o-mini for intermediate steps that don't require flagship reasoning. Compress intermediate tool results to relevant excerpts before injecting into the next step. Set per-session cost caps with automatic termination.


Migrating from the Assistants API

The Assistants API reaches end-of-life on August 26, 2026. Here is the conceptual mapping and migration path:

  1. Audit current usage: Document all Assistants, Threads, and which tools each Assistant uses (File Search, Code Interpreter, Function Calling).
  2. Map Assistants → Agent configs: Each Assistant becomes a system prompt + tool list in your codebase. Assistant IDs disappear; configs live in code or your config store.
  3. Map Threads → conversation storage: Thread IDs are replaced with either previous_response_id chaining (simple flows) or explicit message history in your database (recommended for complex or compliance-sensitive flows).
  4. Map File Search → Responses API file_search: Vector Stores created via the Files API are compatible. The invocation changes from a Run tool to a file_search tool in the tools array — less setup, same underlying infrastructure.
  5. Map Code Interpreter → function calling: The Responses API does not have a built-in Code Interpreter equivalent. Replace with a function tool that calls a sandboxed execution environment (e.g., E2B, Modal, or a container you manage).
  6. Replace polling Runs → streaming Responses: Runs required async polling with status checks. The Responses API is synchronous (or streamed) — dramatically simpler client code.
  7. Run in parallel: For large migrations, run both implementations side-by-side against real traffic, compare outputs, then cut over.

Agents SDK vs. LangGraph

OpenAI Agents SDKLangGraph
Model supportOpenAI models onlyAny LLM provider
Learning curveLow — minimal abstractionsMedium — graph topology required
Stateful workflowsVia previous_response_idFirst-class (checkpointers)
Human-in-the-loopManual implementationBuilt-in interrupt/resume
ObservabilityOpenAI dashboard + exportLangSmith + custom
Production maturityEarly 2025 launchBattle-tested 2024+
Vendor lock-inOpenAI ecosystemMinimal (model-agnostic)

See our LangChain & LangGraph deep-dive for more on the LangGraph side of this comparison.

FAQ

The Responses API is OpenAI's next-generation inference API that replaces both Chat Completions and the Assistants API. It adds built-in tool use (web search, file search, computer use), stateful conversation management via the previous_response_id chaining pattern, and streaming as a first-class design. Chat Completions remains available and fully supported - the Responses API is the recommended path for new agentic and tool-using applications.

Building on the Responses API or migrating off Assistants?

Talk to Us