How to Migrate From OpenAI API to a Fine-Tuned Local Model: A 90-Day Playbook

OpenAI signed a contract with the US Department of Defense. Anthropic walked away from a similar deal. Model deprecations continue. Pricing is unpredictable. And you're running a production AI workload that depends on infrastructure you don't control.

If you've decided it's time to own at least some of your AI stack, this is the playbook.

This isn't a theoretical guide. It's a 90-day operational plan for moving a real AI workload — your highest-volume, most predictable tasks — from a cloud API to a fine-tuned model you own and run locally. By day 90, you'll have a production AI system where no per-token costs, no vendor behavior changes, and no strategic pivots apply.

When NOT to Migrate

Before starting, be honest about which workloads are good candidates and which aren't.

Don't migrate tasks that: require frontier reasoning on novel, open-ended problems (creative writing at the highest level, complex multi-step reasoning across broad domains); have very low volume (under 1,000 API calls per month — cloud is cheaper at that scale); change their input/output requirements frequently, making training data maintenance expensive; or genuinely need the latest model capabilities that open-source hasn't caught up to yet.

Good migration candidates have: high volume (10,000+ calls per month); consistent, narrow task scope (classification, extraction, summarization with defined format, Q&A over a specific domain); available training data from logs or existing labeled examples; and quality requirements where 90-95% accuracy on the specific task is sufficient (which it is for most domain-specific workloads).

The Pre-Migration Audit

Before writing a line of training data, inventory your AI workloads. For each use case, capture: monthly volume, monthly API cost estimate, task type (classification/extraction/generation/Q&A), whether input/output format is consistent, and whether you have or can create 200+ good examples.

Score each use case: high volume × consistent task × training data available = high migration priority. Pick your top 1-3 candidates. Start with one. Don't try to migrate everything at once.

Days 1–30: Build Your Evaluation Foundation

The most important work in the migration is evaluation. You need to be able to measure whether the fine-tuned model actually matches your current API setup before you switch anything in production.

Build Your Training Dataset

From your API logs (most API providers let you export), collect examples where the model produced good outputs. You're looking for:

200–500 high-quality input/output pairs that represent the real distribution of your use case
Coverage of common patterns (the 80% of inputs that look similar) and edge cases (the 20% that are trickier)
Clean outputs — don't include examples where the API produced something you had to manually fix

Format them as JSONL with {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} structure.

If you don't have API logs, create examples manually. 200 high-quality examples beats 2,000 noisy ones. Quality over quantity is the single most important fine-tuning principle.

Build Your Eval Set

Separate from training data: 50–100 held-out examples that you won't use for training. This is how you measure model quality.

Make sure your eval set includes edge cases that could expose failures — unusual input formats, boundary conditions, the hardest 10% of your real workload.

Establish Your API Baseline

Run your entire eval set through your current API setup and record every output. Calculate your baseline metrics (accuracy for classification, ROUGE/BLEU for generation, human judgment scores for open-ended tasks).

Define your acceptance criteria now: what does the fine-tuned model need to achieve to replace the API? A common target is ±5% of the API baseline on your eval set. Some teams target matching performance; others accept a small degradation in exchange for the cost and control benefits.

Days 31–60: Fine-Tune and Validate

Choose Your Base Model

For most domain-specific tasks, a 7B–14B parameter model fine-tuned on your data will match or exceed GPT-4 class performance. Bigger base models aren't always better — they're slower, more expensive to run, and the fine-tuning gain is often smaller.

Recommended starting points:

Llama 3.1 8B or 3.3 70B (Meta): commercial use permitted; Llama 3 Community License allows redistribution of fine-tuned models
Qwen 2.5 7B or 14B (Alibaba): commercial use permitted; strong multilingual performance
Mistral 7B (Mistral): Apache 2.0 license — fully permissive for commercial use
Phi-4 (Microsoft): MIT license; excellent performance at small scale

If your task involves long documents or complex reasoning, start with the 14B range. If your task is narrow and high-volume (classification, extraction), 7B is usually sufficient.

Fine-Tune with Ertas Studio

Upload your training dataset, select your base model, and configure LoRA (Low-Rank Adaptation) settings. LoRA fine-tuning trains a small adapter layer on top of the frozen base model — efficient, fast, and the resulting adapter is typically 50–200MB rather than the full model size.

Setup takes about 2 minutes. Training a 7B model with 500 examples typically completes in under an hour on cloud GPUs.

Evaluate Against Your Baseline

Run the fine-tuned model on your full eval set. Compare to the API baseline you established in Days 1–30.

If quality meets your acceptance criteria: export to GGUF format and proceed to deployment.

If quality falls short: the most common fixes are:

Expand training data (add 200 more high-quality examples targeting the cases where the model fails)
Adjust LoRA rank (higher rank = more capacity to learn; try rank 16 or 32 if you started at 8)
Try a larger base model (7B → 14B)
Review your training data quality — inconsistent examples confuse the model more than insufficient quantity

Most teams need 1–2 iterations. Three rounds of refinement is a reasonable ceiling before reconsidering the task's migration readiness.

Export to GGUF

GGUF is an open format that runs on Ollama, llama.cpp, LM Studio, and other inference runtimes. Exporting to GGUF gives you a portable model that runs on any compatible hardware — no cloud dependency, no inference API, just the weights you own.

Days 61–90: Parallel Deployment and Cutover

This is where you de-risk the transition. You run the fine-tuned model alongside the API, gradually shifting traffic.

Week 9: Deploy the fine-tuned model on your local infrastructure (Ollama is the simplest starting point — ollama run your-model). Route 10% of production traffic to the fine-tuned model. Monitor outputs for quality issues.

Week 10: If the quality metrics from the first week match your expectations, route 25% of traffic to the fine-tuned model. Start tracking cost savings.

Week 11: Route 50% of traffic. Review any cases that triggered a fallback to the API — these are your edge case candidates for the next fine-tuning iteration.

Week 12: If all metrics hold, route 100% of traffic to the fine-tuned model. Keep the API integration code in place but disable it for this workload. Leave it as a fallback for 30 days while you build confidence, then evaluate whether to remove it entirely.

The Economics

For a real example: an agency running 15 client automation workflows on OpenAI's API was spending AU$4,200/month. Per-client LoRA adapters running locally on shared infrastructure cost AU$14.50/month. That's a 99.6% reduction — and the fine-tuned models actually outperformed the API on the specific domain tasks.

At 8,000 users on an indie SaaS app: cloud API at $620/month becomes approximately $28/month on local inference. At 40,000 users, cloud was $3,000/month. Local is still approximately $28/month — the per-query cost is essentially zero once infrastructure is running.

Break-even on the fine-tuning investment (your time for training data + fine-tuning compute costs) is typically 2–4 months at moderate volume.

What You Own at Day 90

A production AI workload with:

Pinned model version: you decide when it updates, not your vendor
Deterministic behavior: the model doesn't change until you train a new version
Zero per-query costs: local inference runs at infrastructure cost, not per-token
Full portability: GGUF runs on any compatible hardware
Governance completeness: you know exactly what the model was trained on, when, and by whom

No vendor behavior changes. No strategic pivots. No deprecation notices. No price increases. The model is yours.

See early bird pricing →

Ertas Studio handles the entire pipeline from dataset upload to GGUF export — no Python, no YAML configs, no CLI. Start with a free account and fine-tune your first model before committing to anything.