
Replacing OpenAI in OpenAI Agents SDK With Your Fine-Tuned Local Model
The OpenAI Agents SDK is intentionally model-agnostic. Swap the OpenAI client for an Ertas-trained model running on Ollama and you keep the developer experience while killing per-token costs. A drop-in tutorial.
Updated 2026-05-10 — Reflects OpenAI Agents SDK v0.17.0 (May 7, 2026), which bumped the default
RealtimeAgentmodel togpt-realtime-2. Pin to a specific SDK version in production rather than tracking the moving default — the swap pattern below is unchanged either way.
The OpenAI Agents SDK is one of the cleanest agent frameworks shipped in 2026. It is the production-grade successor to Swarm, with first-class tool calling, handoffs between agents, structured outputs, and a tracing system that has set the bar for the rest of the ecosystem. Despite the name on the box, it is not OpenAI-only. The SDK accepts any OpenAI-compatible HTTP endpoint as the backing model, which means the developer experience travels with you when you swap the model out.
That detail is the entire point of this guide. You can build an agent against the SDK, prototype it against a hosted OpenAI model, and then point the same code at a fine-tuned 7B model running locally on Ollama. The agent code does not change. The traces do not change. The bill does change, by roughly an order of magnitude or more, depending on traffic.
Why this matters
Agentic apps burn through tokens in a way chatbots do not. A typical chat turn is one or two model calls. A typical agent task is five to thirty model calls, because the agent thinks, picks tools, reads tool output, decides what to do next, and so on. The token bill scales with the loop, not with the user-facing turn.
The economic shape that produces is unusual. An agent product at 1K monthly actives running on baseline cloud APIs is often only paying around $120 per month for inference. Founders look at that number and assume the unit economics are fine. They are not. At 40K MAU, the same product is paying $3,000 per month or more, often closer to $6,000 once you account for retries, longer contexts, and complex flows. The cost line goes up roughly linearly with users, while revenue per user goes up nowhere near that fast.
A fine-tuned local model fixes the curve. The cost is fixed: a GPU instance for the backend path, a one-time model download for the on-device path. Add another 100K users and the bill is the same.
The setup: a customer-support triage agent
We will build a triage agent that classifies inbound support requests, looks up the customer, creates a ticket if appropriate, and escalates to a human if the request is high-severity. It has three tools and a clear scope, which makes it a good candidate for fine-tuning.
Start with a fresh Python project and install the OpenAI Agents SDK.
pip install openai-agents
Here is the agent. It is intentionally short — about 25 lines of meaningful code — because the SDK is well-designed.
from agents import Agent, Runner, function_tool
from pydantic import BaseModel
class TriageDecision(BaseModel):
category: str
severity: int
ticket_id: str | None = None
escalated: bool = False
@function_tool
def lookup_customer(email: str) -> dict:
"""Look up customer details by email address."""
return crm.get_customer(email)
@function_tool
def create_ticket(customer_id: str, summary: str, severity: int) -> str:
"""Create a support ticket. Returns the ticket ID."""
return tickets.create(customer_id, summary, severity)
@function_tool
def escalate(customer_id: str, reason: str) -> bool:
"""Escalate to a human agent for high-severity issues."""
return on_call.page(customer_id, reason)
triage_agent = Agent(
name="Support Triage",
instructions="Triage inbound support requests. Look up the customer, "
"create a ticket, and escalate severity 4 or 5 issues.",
tools=[lookup_customer, create_ticket, escalate],
output_type=TriageDecision,
)
result = Runner.run_sync(
triage_agent,
"alice@example.com says her production database is unreachable and customers are seeing errors."
)
print(result.final_output)
By default this runs against an OpenAI hosted model, with whatever model name you have configured in OPENAI_API_KEY and the SDK defaults. You will see the agent call lookup_customer, then create_ticket, then escalate because severity is high, and emit a validated TriageDecision. The flow takes a few seconds and produces a structured object you can route into your support system.
So far this is plain OpenAI Agents SDK usage. Now we swap the model.
The swap: point the SDK at your local model
The OpenAI Agents SDK exposes an AsyncOpenAI client through OpenAIChatCompletionsModel and OpenAIResponsesModel. Both accept a base_url. That is the entire seam. Point the client at any OpenAI-compatible endpoint and the SDK uses it for completions, tool calls, and structured outputs the same way it would with the hosted API.
For local development, Ollama is the path of least resistance. It exposes an OpenAI-compatible endpoint at http://localhost:11434/v1 out of the box.
Here is the swap. This is the only change required to move the agent from OpenAI's hosted model to a fine-tuned local one.
from openai import AsyncOpenAI
from agents import set_default_openai_client, set_default_openai_api
client = AsyncOpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")
set_default_openai_client(client)
set_default_openai_api("chat_completions")
triage_agent = Agent(
name="Support Triage",
model="ertas-triage-7b",
instructions="...",
tools=[lookup_customer, create_ticket, escalate],
output_type=TriageDecision,
)
Five lines. The Runner.run_sync call, the tool definitions, the structured output schema — all unchanged. The agent now runs entirely against a model on your laptop or your own GPU instance.
Why fine-tuning matters here
You can stop after the swap if you only care about cost. But you should not, because a generic open-weight 7B model is not actually good enough for this kind of agent. The numbers are usually disappointing in a specific way: the model picks the right tool most of the time, fills the right parameter names most of the time, and gets parameter values right most of the time, and "most of the time" multiplied across a 5-call loop is a 60% end-to-end success rate.
Representative ranges for a triage-style agent like the one above, drawn from public BFCL v4 sub-scores and from the typical post-fine-tune outcomes Studio reports against held-out evaluation sets:
- Generic Llama 3.1 8B: tool-name and parameter-name accuracy in the low-to-mid 80s on broad tool sets, parameter-value accuracy slightly lower. End-to-end task completion in the 70s once those error rates compound across a 5-call loop.
- Generic Qwen3-7B / Qwen3-8B: low-to-mid 80s across all three sub-metrics on broad tool sets. End-to-end completion in the high 70s.
- Ertas-fine-tuned 7B–8B trained on 400–600 representative examples: typically clears 95% on each sub-metric for the trained tool surface. End-to-end completion in the mid-90s once schemas are stable.
Your own numbers will sit somewhere in those bands depending on tool count, schema complexity, and dataset quality. The fine-tuning gap is the difference between a fun prototype and a production system. A 95% completion rate is in the same range as a frontier hosted model on this kind of bounded agentic task. A 79% completion rate is not.
The training data shape that gets you there is straightforward. You want around 500 examples covering the tool combinations the agent actually uses in production: single-tool calls (the simple cases), multi-tool sequences (the realistic cases), validation edge cases (when the model needs to recognize incomplete information and ask back), and refusals (when the model needs to recognize out-of-scope requests). Studio's Data Craft module generates the bulk of these from a structured prompt template — you write 30 to 50 seed examples by hand, the rest comes from a guided generation pass that validates each example against your tool schemas before it lands in the dataset.
A QLoRA fine-tune of a 7B base on 500 examples completes in under an hour on the standard GPU tier. Studio's eval suite reports the three accuracy metrics above against a held-out validation set, and flags the specific tool/parameter combinations where the model is still weak so you can target the dataset at the gaps and run incremental training.
Tracing parity
The OpenAI Agents SDK has a tracing system that captures every model call, every tool invocation, and every handoff in a unified trace. You can view traces in OpenAI's dashboard, or send them to Logfire, Datadog, or any OTLP backend.
The thing that matters operationally: the tracing system works identically whether the backing model is OpenAI's hosted endpoint or your local Ollama-served model. Every model call, every tool call, every retry shows up in the same trace UI with the same structure. You can develop the agent against the OpenAI hosted model where iteration is fastest, ship the production version against your fine-tuned local model, and inspect traces from both in the same dashboard.
This is more important than it sounds. One of the reasons agent teams stay on hosted APIs longer than they should is that observability is so good against hosted models — the traces are clean, the dashboard works, the team has muscle memory for it. The instinct to swap the model out gets killed by the assumption that the observability layer goes with it. With the OpenAI Agents SDK, it does not. The trace UI is part of the SDK, not part of the hosted model service.
Production deployment patterns
Three deployment patterns cover most of the cases that come through real agent products.
Backend pattern: vLLM behind a private endpoint. For agents that run on your servers and call your own infrastructure, the right deployment is vLLM serving the fine-tuned model on a single GPU instance, with the OpenAI Agents SDK pointed at the vLLM endpoint instead of OpenAI's. vLLM's OpenAI-compatible server supports tool calling, structured outputs, and batching well enough for production traffic. A single A10 instance handles a few hundred concurrent agent flows depending on context length. Studio's export flow produces both a GGUF binary (for Ollama and llama.cpp) and the original safetensors checkpoint (for vLLM and TGI), so you do not have to choose at training time.
On-device pattern: ship the model into the app. For mobile and desktop apps where running the agent on the user's device is feasible, the Ertas Deployment CLI ships the fine-tuned model as a GGUF binary into iOS, Android, Flutter, or React Native projects with llama.cpp's mobile bindings. The deployed model exposes the same OpenAI-compatible interface locally on the device. The OpenAI Agents SDK Python or TypeScript code (typically running on a backend you control) points at the device's local endpoint when the architecture calls for it, or you bridge to a native runtime through a thin RPC layer. The Ertas docs cover both paths.
Hybrid pattern: fast path local, complex path API. For agents where the workload distribution is uneven — most tasks are routine, a few are exotic — the right answer is to route. The fast path runs the fine-tuned local model. The slow or unusual path falls back to a frontier hosted model. The OpenAI Agents SDK makes this clean: define two Agent instances with different model settings, write a top-level router agent that hands off based on classification, and you have an architecture that pays hosted-API prices only for the calls that actually need them. In practice this collapses 80 to 95% of token cost while preserving frontier-model behaviour on the genuinely hard cases.
The composition
The OpenAI Agents SDK is the best-in-class developer experience for Python and TypeScript agent code in 2026. The fine-tuned Ertas-trained model is the best-in-class economic model for production agentic workloads. They were not designed for each other, and that is precisely why they compose so well — the SDK's model-agnostic design means it does not need to know anything about your model, and your fine-tuned model does not need to know anything about the SDK. The OpenAI-compatible HTTP interface is the contract, and as long as both sides honour it, the rest is plumbing.
For a team that has prototyped against OpenAI and is now staring at the cost curve, the migration is unusually painless. Five lines of configuration. One fine-tune in Studio. Same code, same traces, same agent flow. Different bill.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

How to Create a Tool-Calling Training Dataset for Fine-Tuning
The biggest gap in fine-tuning guides: nobody covers how to actually build the dataset. Here's a step-by-step process to create tool-calling training data — from schema documentation to synthetic expansion to JSONL formatting — with real examples for a 5-tool customer service agent.

Pydantic AI On-Device: Fine-Tune Qwen3-4B for Type-Safe Mobile Agents
Pydantic AI brings type safety and FastAPI ergonomics to LLM agents. Combine it with a fine-tuned 4B model running on-device via llama.cpp and you get production-grade agents in mobile apps with zero API costs and validated outputs by construction.

Mastra + Vercel AI SDK + On-Device GGUF: A TypeScript Mobile Agent Stack With No API Costs
TypeScript-first mobile builders don't have to use Python agent frameworks. Mastra and the Vercel AI SDK plus a fine-tuned 4B model running on-device through llama.cpp produce a complete agent stack with zero per-token costs.