The OpenAI Responses API launched in early 2025 as the recommended path for new agentic applications. The Assistants API is deprecated with an EOL of August 26, 2026. The Agents SDK (Python + TypeScript) is the official open-source framework built on top of it.
OpenAI Responses API & Agents SDK
The Responses API is OpenAI's unified inference interface designed for agentic applications. It combines model inference, built-in tool execution, and optional stateful conversation management in a single API surface. The companion Agents SDK provides a lightweight framework for multi-agent orchestration — agents, handoffs, guardrails, and tracing — without requiring a full third-party orchestration layer like LangChain or LangGraph.
Together, they represent OpenAI's opinionated path for building production AI agents that stay within the OpenAI ecosystem. We have production experience with both and work with teams migrating from the deprecated Assistants API.
Responses API: Core Concepts
Unified Tool Execution
Built-in tools — web_search_preview, file_search, computer_use_preview — are first-class citizens of the API. You declare which tools are available; the model decides when to invoke them; OpenAI executes them and returns results as part of the response. No polling loops, no separate Run state machine.
Stateful Conversations
Pass store: true on any call and OpenAI stores the response server-side. On subsequent turns, pass previous_response_id instead of a full message array. OpenAI reconstructs context automatically. Opt-out anytime by returning to manual message history management — the two patterns interoperate.
Streaming First-Class
All response types — text, tool calls, reasoning summaries — stream as typed events. The ResponseStream abstraction in the SDK surfaces events like response.text.delta, response.tool_call.arguments.delta, and lifecycle events. Streaming is not an afterthought; it is the primary design mode.
Reasoning Models
The o1, o3, and o3-mini series are fully supported. Set reasoning.effort to low, medium, or high to trade latency for answer quality. Streaming reasoning summaries (not full chain-of-thought) are available where models support it.
Structured Outputs
JSON Schema enforcement is built into the API via text.format. When you define a schema, the model is constrained to produce valid JSON matching it. This removes the need for output parsers and retry logic, and makes downstream validation deterministic.
Background Mode
Long-running agentic tasks (multi-step research, code generation, document analysis) can run asynchronously via background: true. You receive a response ID immediately and poll for completion — a correct pattern for tasks that exceed HTTP timeout limits or need to survive client disconnects.
Built-in Tools
Web Search
web_search_preview gives the model live web access. Results are cited inline with source URLs. Useful for agents that need current information beyond training data cutoffs — news analysis, pricing research, documentation lookup. The model decides when a query warrants a search; you can also force tool use via tool_choice.
File Search
Semantic search over Vector Stores you create via the Files API. Supports hybrid search (keyword + semantic), multiple vector stores per call, and attribute filters. Replaces Assistants API File Search with the same underlying infrastructure but a simpler invocation model. Max 10,000 files per vector store.
Computer Use (Preview)
computer_use_preview allows the model to control a desktop or browser environment — click, type, scroll, screenshot. Built on the CUA (Computer Use Agent) model. Appropriate for automation workflows where structured APIs don't exist. Currently preview; production use requires careful sandboxing and human oversight.
Function Calling (Custom Tools)
Your own tools alongside built-in ones. Define JSON Schema function specs; the model generates valid arguments; your code executes and returns results. The Responses API processes function results within the same request-response cycle via the tool_result input type — simpler than the Assistants API Runs pattern.
OpenAI Agents SDK
The Agents SDK (openai-agents in Python, @openai/agents in TypeScript/JS) is an open-source framework that wraps the Responses API with primitives for multi-agent orchestration. It is intentionally lightweight — no graph topology required, no proprietary abstractions — designed to be readable and easy to extend.
Agents
The core primitive. An Agent is a named configuration object: system instructions, a model, a list of tools, and optional output type (for structured outputs). Agents are stateless — all state lives in the Runner context. You can define dozens of specialized agents (triage, research, writer, validator) and wire them together via handoffs.
Handoffs
Agents can transfer control to other agents via handoff tools. The current agent calls transfer_to_[agent_name], passes a context message, and execution continues with the target agent in the same conversation thread. This is how you build triage → specialist workflows. Handoffs are implemented as regular function calls — fully transparent in traces.
Guardrails
Parallel safety and validation checks that run alongside the main agent response. An input guardrail fires before the main agent response — useful for topic classification, PII detection, or abuse screening. An output guardrail fires after — useful for policy compliance, hallucination checks, or format validation. Guardrails can trip a GuardrailTripwireTriggered exception that halts the workflow.
Runner & Context
The Runner.run() method executes an agent loop: call model, process tool calls, repeat until a final output or handoff. The RunContext carries user-defined state across the entire run — database connections, user session data, accumulated results. Context is typed and dependency-injected; agents receive it via tool function signatures.
Tracing
Every SDK run generates a full trace: agent invocations, tool calls, handoff decisions, guardrail evaluations, token usage. Traces send to the OpenAI platform by default (visible in the dashboard) and can be exported to third-party observability tools. This is the main observability advantage of the SDK vs. raw API calls.
Voice Pipeline
The SDK includes a VoicePipeline abstraction for speech-to-speech agent applications: STT → agent → TTS, with streaming audio in and out. Useful for phone agents, voice assistants, and real-time meeting bots. Audio input goes through Whisper; output through TTS; the agent runs on the Responses API in between.
Responses API vs. Assistants API
| Responses API | Assistants API (deprecated) | |
|---|---|---|
| Status | Active — recommended path | Deprecated — EOL Aug 26, 2026 |
| Latency | 1–3 seconds TTFT | 4–28 seconds (reported 2025) |
| Tool execution | Inline — same request cycle | Async Runs with polling |
| Conversation state | Optional (previous_response_id) | Managed (Thread IDs in your DB) |
| Built-in web search | Yes (web_search_preview) | No |
| Cost model | Token-based only | +File re-processing per Run |
| Observability | Full streaming event trace | Limited Run inspection |
| SDK support | Agents SDK (Python + TS) | No official framework |
Production Issues We've Encountered
Store Mode Data Residency
When store: true is set, conversation history is stored in OpenAI's infrastructure. For healthcare, legal, or financial applications with data governance requirements, this may be incompatible with compliance obligations.
Our approach: Default to manual message history management with store: false. Maintain conversation history in your own database, giving you full control, portability, and auditability. Only use store mode for non-sensitive, internal tooling.
Web Search Hallucinated Citations
The web_search_preview tool occasionally returns citations that do not match the actual retrieved content — particularly on low-traffic URLs or paywalled sources the model has seen in training data.
Our approach: Always surface citations to end users with clickable links — let humans verify. For automated pipelines, treat web search as a signal, not ground truth. Implement a validation step that fetches and spot-checks cited URLs before including them in downstream outputs.
Handoff Loops in Multi-Agent Workflows
Agents can get into circular handoff patterns — agent A defers to agent B, which defers back to A, indefinitely. The SDK does not enforce a hard handoff depth limit by default. This surfaces as a hanging run rather than an error.
Our approach: Set explicit max_turns on the Runner. Design triage agents with clear escalation rules and a “cannot resolve” fallback path. Monitor handoff counts per run in traces and alert on runs exceeding your expected depth.
Guardrail Latency Overhead
Guardrails run in parallel but still add measurable latency — particularly output guardrails that must wait for the primary response to complete before evaluating. A guardrail that calls a second model adds a full inference round-trip to every user interaction.
Our approach: Use fast, cheap models (gpt-4o-mini, o3-mini low) for guardrails. Design guardrails as binary classifiers, not complex evaluators. For latency-sensitive applications, consider async guardrail evaluation with post-hoc intervention rather than blocking the user response.
Background Mode Polling Complexity
Background runs require your application to manage polling — request status, handle timeouts, surface partial results, and recover from failures. Teams underestimate the infrastructure required: job queue, status persistence, webhook or SSE for completion notification.
Our approach: Treat background runs as async jobs from day one. Use a proper job queue (BullMQ, Celery, or similar) to manage background run lifecycles. Store run IDs with job metadata; implement idempotent polling with exponential backoff; surface progress to users via a status endpoint rather than leaving them with a spinner.
Context Injection Costs at Scale
Agentic runs accumulate context rapidly — tool call results, intermediate reasoning, handoff messages. A 10-step agent run with moderate tool outputs can easily consume 50,000+ input tokens. At GPT-4o pricing, this makes per-session costs an order of magnitude higher than simple chat applications.
Our approach: Model each agent run's token budget before deployment. Use gpt-4o-mini for intermediate steps that don't require flagship reasoning. Compress intermediate tool results to relevant excerpts before injecting into the next step. Set per-session cost caps with automatic termination.
Migrating from the Assistants API
The Assistants API reaches end-of-life on August 26, 2026. Here is the conceptual mapping and migration path:
- Audit current usage: Document all Assistants, Threads, and which tools each Assistant uses (File Search, Code Interpreter, Function Calling).
- Map Assistants → Agent configs: Each Assistant becomes a system prompt + tool list in your codebase. Assistant IDs disappear; configs live in code or your config store.
- Map Threads → conversation storage: Thread IDs are replaced with either
previous_response_idchaining (simple flows) or explicit message history in your database (recommended for complex or compliance-sensitive flows). - Map File Search → Responses API file_search: Vector Stores created via the Files API are compatible. The invocation changes from a Run tool to a
file_searchtool in the tools array — less setup, same underlying infrastructure. - Map Code Interpreter → function calling: The Responses API does not have a built-in Code Interpreter equivalent. Replace with a function tool that calls a sandboxed execution environment (e.g., E2B, Modal, or a container you manage).
- Replace polling Runs → streaming Responses: Runs required async polling with status checks. The Responses API is synchronous (or streamed) — dramatically simpler client code.
- Run in parallel: For large migrations, run both implementations side-by-side against real traffic, compare outputs, then cut over.
Agents SDK vs. LangGraph
| OpenAI Agents SDK | LangGraph | |
|---|---|---|
| Model support | OpenAI models only | Any LLM provider |
| Learning curve | Low — minimal abstractions | Medium — graph topology required |
| Stateful workflows | Via previous_response_id | First-class (checkpointers) |
| Human-in-the-loop | Manual implementation | Built-in interrupt/resume |
| Observability | OpenAI dashboard + export | LangSmith + custom |
| Production maturity | Early 2025 launch | Battle-tested 2024+ |
| Vendor lock-in | OpenAI ecosystem | Minimal (model-agnostic) |
See our LangChain & LangGraph deep-dive for more on the LangGraph side of this comparison.
FAQ
The Responses API is OpenAI's next-generation inference API that replaces both Chat Completions and the Assistants API. It adds built-in tool use (web search, file search, computer use), stateful conversation management via the previous_response_id chaining pattern, and streaming as a first-class design. Chat Completions remains available and fully supported - the Responses API is the recommended path for new agentic and tool-using applications.
The OpenAI Agents SDK (available in Python and TypeScript) is an open-source framework for building multi-agent workflows on top of the Responses API. It provides Agent primitives (system prompt + tools + model config), Handoffs (transfer of control between agents), Guardrails (parallel safety checks), and built-in tracing. It is lighter than LangGraph and stays tightly coupled to OpenAI's tooling - a reasonable choice when you are fully committed to the OpenAI ecosystem.
Use the Agents SDK if your stack is OpenAI-exclusive and you want a minimal framework with first-party support. Choose LangGraph if you need vendor-agnostic model routing, complex stateful workflows, human-in-the-loop patterns with persistent checkpointing, or production observability through LangSmith. LangGraph has a steeper learning curve but handles more complex agentic topologies. We have production experience with both and can help you choose.
Two patterns: manual (stateless) or automatic. In manual mode, you pass a full input array of messages - the same pattern as Chat Completions. In automatic mode, you use store: true on the first call, then pass previous_response_id on subsequent calls and OpenAI manages the conversation thread server-side. The automatic mode is convenient but means your conversation history lives in OpenAI's infrastructure, not yours - factor this into your data governance and cost model.
Conceptually: Assistants → system prompts in your code, Threads → previous_response_id chaining (or your own message history), Runs → single Responses API calls, File Search → built-in file_search tool with Vector Stores (same underlying infra, new API surface), Code Interpreter → not directly replicated in Responses API - use function calling to a sandboxed execution environment instead. Plan 3–6 weeks depending on implementation complexity.