Is the Agents SDK production-ready?

The Agents SDK launched in early 2025 and is actively developed. For OpenAI-exclusive stacks building straightforward triage → specialist agent patterns, it is a reasonable production choice. For complex stateful workflows, human-in-the-loop requirements, or multi-provider setups, LangGraph is more mature. We have production experience with both and can help assess the right fit for your use case.

Should we migrate off the Assistants API now?

Yes — begin planning now. The deadline is August 26, 2026, but migrations of mature Assistants API implementations take 4–8 weeks. Beyond the deadline, the Responses API is objectively better: 1–3 second latency vs. 4–28 seconds, no Run polling complexity, no per-run file re-processing cost, and simpler client code. There is no reason to wait.

GPT-4o vs GPT-4o-mini: how do we choose?

Benchmark your specific use case against both models. GPT-4o-mini handles the majority of production workloads — classification, summarization, simple generation, FAQ answering — at 10–20x lower cost. Use GPT-4o for complex reasoning, nuanced writing, multi-step problem solving, and tasks where you can measure a quality gap that justifies the cost. In agentic workflows, route individual steps to the cheapest model that meets the quality bar for that step.

How do we keep our data from being used to train OpenAI's models?

OpenAI does not use API data for model training by default (policy since March 2023). Confirm via your API usage agreement. For additional data residency or compliance requirements, Azure OpenAI Service offers the same models within your Azure tenant under Microsoft's data processing terms, with options for data residency in specific regions.

Development Technology OpenAI API

AI Frameworks

OpenAI API

Flagship Model

GPT-4o

Context Window

128K tokens

New API

Responses API

Agent SDK

openai-agents

The OpenAI API is the most widely used interface for accessing large language model capabilities in production. The Responses API — launched in 2025 — is OpenAI's next-generation inference surface, replacing both Chat Completions and the deprecated Assistants API. It adds built-in tool execution (web search, file search, computer use), stateful conversation management, and streaming as a first-class design. The companion Agents SDK provides lightweight multi-agent orchestration: Agents, Handoffs, Guardrails, and tracing.

Axevate builds production systems on OpenAI models across customer support automation, document Q&A pipelines, agentic workflows, and content generation. Our experience includes cost optimization (commonly 40–60% token reduction from initial implementations), structured output design, function calling reliability, and Assistants API migration planning.

1Responses API: The New Standard

The Responses API replaces both Chat Completions and the Assistants API as OpenAI's recommended inference interface for agentic and tool-using applications. The core model is familiar: send an input (messages or a previous_response_id), declare which tools are available, get a response. What changes is the scope: built-in tools execute inline within the API call rather than requiring an async Run-then-poll cycle, and conversation state is optionally managed server-side via previous_response_id chaining.

Built-in tools include web_search_preview (live web access with inline citations), file_search (semantic search over Vector Stores), and computer_use_preview (desktop/browser control, currently preview). Custom function tools work alongside built-in tools in the same call. The model decides which tools to invoke and in what order; the API handles execution and injects results before returning the final response.

Streaming is first-class: all response types — text deltas, tool call arguments, reasoning summaries — surface as typed events via the ResponseStream abstraction. This is a meaningful improvement over the Assistants API's polling model, which required clients to repeatedly check Run status until completion.

2OpenAI Agents SDK

The Agents SDK (openai-agents in Python, @openai/agents in TypeScript) is an open-source framework for multi-agent workflows built on top of the Responses API. Its core primitives are Agents (system prompt + tools + model config, stateless), Handoffs (transfer of execution control between agents via tool call), and Guardrails (parallel input/output checks that can trip a circuit-breaker exception).

Handoffs enable triage → specialist patterns: a routing agent classifies the user's intent and hands off to a domain-specific agent. The handoff mechanism is implemented as a regular function tool — fully visible in traces, debuggable, and interruptible. Multiple handoffs in a single run are supported; circular handoff loops are a known risk that requires explicit max_turns limits.

The SDK's built-in tracing sends every agent invocation, tool call, handoff decision, and guardrail result to the OpenAI dashboard by default, with export capability for third-party observability tools. This is the primary observability advantage over raw API calls. A VoicePipeline abstraction also ships with the SDK for STT → agent → TTS streaming audio applications.

3Chat Completions: Still the Workhorse

Chat Completions remains fully supported and appropriate for the majority of production workloads — classification, summarization, simple generation, FAQ answering, structured extraction. Every request sends an array of messages and returns a completion. The Responses API is the recommended upgrade path for tool-using and agentic applications, but teams with working Chat Completions integrations have no urgency to migrate.

Model selection is the first cost and quality decision. GPT-4o is the current flagship: strong at reasoning, long-context tasks, and multimodal inputs. GPT-4o-mini is 10–20x cheaper with performance sufficient for most production workloads. The o1 and o3 series trades latency for extended reasoning capability and suits complex mathematical or logical tasks. The correct approach is to benchmark each use case against multiple models and route to the cheapest model that meets your quality bar.

Structured outputs enforce that the model's response conforms to a JSON schema you define via text.format. This removes the need for output parsers and retry logic, makes downstream validation deterministic, and dramatically simplifies monitoring. Use structured outputs whenever the application needs to process or store the LLM response programmatically.

4Embeddings and Semantic Search

The Embeddings API converts text to dense vector representations. text-embedding-3-large (3072 dimensions, reducible to 256 with minimal accuracy loss) is appropriate for production RAG systems requiring high recall. text-embedding-3-small is 5x cheaper with performance sufficient for most similarity search use cases.

Embeddings power RAG pipelines, semantic search, clustering, and classification. In a RAG setup: embed your knowledge base, store vectors in pgvector, Qdrant, or Pinecone, and at query time embed the user's question, retrieve top-N similar chunks, and pass them as context to the LLM. The Responses API's built-in file_search tool provides a managed version of this pipeline for straightforward document Q&A use cases.

5Assistants API - Deprecation and Migration

The Assistants API is deprecated with an end-of-life date of August 26, 2026. Production applications built on it must migrate. The Responses API is the official migration path: Assistants become system prompts in code, Threads become previous_response_id chaining or explicit message history in your database, and Runs become synchronous Responses API calls. File Search migrates to the built-in file_search tool using the same Vector Store infrastructure. Code Interpreter has no direct equivalent in the Responses API — replace with function calling to a sandboxed execution environment.

The motivation for migration is significant beyond the deadline: the Assistants API had documented production issues including response latencies reaching 28+ seconds (vs. 1–3s for Responses API), cost overhead from re-processing thread files on every Run, and unreliable function calling in long threads. The Responses API resolves all of these.

How We Use It in Practice

Real architectural problems across industries — and how we approach them.

Media & Content Operations

Tiered Content Pipeline: GPT-4o-mini Draft + GPT-4o Quality Gate

A digital publisher producing 500+ articles per month needed AI-assisted drafting at a cost that worked — but GPT-4o on every piece would have cost $8,000/month for their volume. Sending everything to GPT-4o-mini produced output that editors rejected at a 40% rate. The team had no systematic way to route by complexity.

Our approach

Two-stage OpenAI pipeline: GPT-4o-mini generates a draft and simultaneously scores it on three dimensions (factual density, argument complexity, SEO sensitivity) using structured outputs (JSON schema enforced). Scores above threshold route to GPT-4o for a rewrite pass. Scores below threshold go directly to editorial queue. Cloudflare AI Gateway handles semantic caching across similar brief types — 30% of drafts are now served from cache. Token cost dropped 65% from the all-GPT-4o baseline while editor rejection rates fell to under 10%.

eCommerce / Retail

Semantic Product Search: OpenAI Embeddings + pgvector + Shopify Catalog

A specialty outdoor retailer with 40,000 SKUs found that keyword search was failing customers: queries like 'waterproof jacket for cold wet weather' returned zero results, while 'rain jacket' returned too many undifferentiated options. Shopify's built-in search had no semantic layer. A dedicated vector search service meant a second database to maintain and sync.

Our approach

OpenAI text-embedding-3-small generates embeddings for every product description and attribute field at index time, stored as pgvector HNSW-indexed columns directly in PostgreSQL alongside the product catalog. At query time, the search string is embedded and the top-30 vector matches are returned, then re-ranked by a lightweight GPT-4o-mini call that scores relevance against the customer's full query intent. A Shopify webhook pipeline keeps embeddings current as inventory changes. Search result click-through improved by 38%; add-to-cart from search pages up 22%.

Financial Services / Compliance

Regulated AI Assistant: Store-Off Architecture with Full Audit Trail

A wealth management firm wanted an AI assistant to help advisors draft client communications and summarize portfolio data. Compliance required that all conversation data stay within their own infrastructure, be auditable, and be deletable on request. OpenAI's default store mode and the Assistants API both routed conversation history through OpenAI's servers, creating a data residency conflict with their custodial agreement.

Our approach

Responses API with store: false throughout — all conversation history managed in a PostgreSQL database under the firm's own AWS environment. Each advisor session gets a thread_id; messages are stored with advisor ID, client ID, and timestamp, enabling per-client deletion and full audit query. GPT-4o structured outputs enforce that all draft communications are returned as typed JSON (subject, body, tone_flag) so compliance tooling can pattern-match before delivery. Multi-provider fallback via Cloudflare AI Gateway routes to Anthropic Claude Sonnet as secondary if OpenAI returns sustained errors — advisors noticed zero downtime during two OpenAI incidents since deployment.

FAQ

For simple text generation, summarization, or classification: Chat Completions is fine and has no urgency to change. For tool-using or agentic applications — anything that calls external APIs, searches documents, or involves multiple sequential reasoning steps — start with the Responses API. It's OpenAI's recommended path and has better built-in tooling, streaming, and observability.

Ready to build with OpenAI API?

Talk to Us