
From API-Dependent to Model Owner: A 90-Day Migration Playbook
A phased, risk-managed plan for migrating your AI workloads from cloud APIs to fine-tuned models you own. Week-by-week breakdown with concrete milestones for each phase.
You've read about vendor dependency risks. You've done the independence checklist. You know the cost math works in favour of owned models. Now you need a plan.
This playbook covers the first 90 days of migrating from API-dependent to model-owning. It's designed for teams without ML expertise, assuming you have access to your API logs and domain data. The goal isn't to eliminate all API usage — it's to own your most critical AI capabilities and build a foundation for continued independence.
Before You Start: The Migration Mindset
Two principles make the difference between a smooth migration and a painful one:
Parallel run, not cold switch. You're not ripping out your API integration and replacing it with a fine-tuned model on day one. You're running both side-by-side, comparing quality, and routing traffic gradually. The API stays live until the fine-tuned model proves itself.
Start narrow, expand systematically. Don't try to migrate everything at once. Pick one task. Get it right. Build confidence and institutional knowledge. Then repeat.
Phase 1: Audit (Days 1-14)
Week 1: Inventory Your AI Touchpoints
Map every place your application or workflow calls an AI API. For each touchpoint, document:
| Field | Example |
|---|---|
| Task description | Classify support tickets into categories |
| Provider/model | OpenAI GPT-4o-mini |
| Monthly volume | 12,000 requests |
| Monthly cost | $340 |
| Input format | Unstructured text (1-3 paragraphs) |
| Output format | Single category label from predefined list |
| Quality requirement | 90%+ accuracy |
| Criticality | High — routes tickets to correct team |
| Training data available | Yes — 18 months of classified tickets in CRM |
Most teams discover they have 3-8 distinct AI tasks in production. Some have more.
Week 2: Score and Prioritise
Score each task on three dimensions:
Fine-tuning suitability (1-5):
- Consistent input/output format → higher score
- Large volume → higher score
- Available training data → higher score
- Domain-specific vocabulary or knowledge → higher score
- Subjective or creative output → lower score
Business impact (1-5):
- High monthly cost → higher score
- Customer-facing → higher score
- SLA-sensitive → higher score
- Revenue-generating → higher score
Migration complexity (1-5, lower is better):
- Simple classification/extraction → low complexity
- Multi-step reasoning → medium complexity
- Open-ended generation → higher complexity
- Multi-modal (text + images) → highest complexity
Priority = Suitability × Impact ÷ Complexity
Your highest-scoring task is your pilot migration target. In most businesses, it's one of these:
- Customer support ticket classification/routing
- Content generation in a specific format
- Data extraction from structured documents
- FAQ/knowledge base response generation
- Lead qualification or scoring
Phase 2: Pilot (Days 15-45)
Week 3: Prepare Your Training Dataset
Your API logs are your training data. Extract input/output pairs from your production system.
Minimum dataset size: 500 high-quality examples. This is enough for a well-defined task with consistent format.
Recommended: 1,000-2,000 examples. Gives the model more edge cases to learn from.
Quality over quantity. 500 carefully reviewed examples outperform 5,000 noisy ones. Spend time on data quality, not just volume.
Dataset preparation steps:
-
Export raw data. Pull input/output pairs from your API logs, CRM, or database. Format as JSONL with the chat message structure your training tool expects.
-
Filter for quality. Remove examples where the API output was incorrect, poorly formatted, or required manual correction. You want only examples of the task done right.
-
Deduplicate. Near-identical examples add noise. Remove duplicates and near-duplicates.
-
Balance categories. If you're training a classifier, ensure reasonable representation across all categories. Extreme imbalance (90% category A, 2% category B) causes the model to underperform on minority categories.
-
Split the data. Reserve 10-15% as a test set that won't be used in training. This is your evaluation benchmark.
Week 4-5: Fine-Tune the Model
Select your base model. For most business tasks:
- 7B parameters — Fast inference, runs on consumer hardware, good for classification and extraction
- 14B parameters — Better for generation tasks, requires more compute but still practical
- Llama 3, Qwen 2.5, or Mistral — All production-quality, all commercially permissive
Choose your training approach:
- LoRA/QLoRA — The standard approach. Trains lightweight adapters (50-200MB) on top of frozen base weights. Memory-efficient, fast to train, and the adapter is portable.
- Full fine-tuning — Modifies all weights. Better for complex tasks but requires more compute. Usually unnecessary for well-defined business tasks.
Training configuration (starting point):
- Learning rate: 2e-4
- Batch size: 4-8
- Epochs: 2-3
- LoRA rank: 32
Using Ertas: Upload your JSONL dataset, select your base model, and start training. The platform handles GPU provisioning, hyperparameter management, and progress tracking. Setup takes about 2 minutes. Training time depends on dataset size and model — typically 15-60 minutes for a LoRA fine-tune.
Run 2-3 experiments. Try different base models, LoRA ranks, or training durations. Side-by-side comparison across experiments helps you find the best configuration.
Week 6: Evaluate
Run your held-out test set through both the API model and your fine-tuned model. Compare:
Quantitative metrics:
- Accuracy (for classification/extraction tasks)
- Format compliance (does the output match your expected structure?)
- Consistency (same answer for equivalent inputs?)
- Latency (response time per request)
Quality threshold: For domain-specific tasks with good training data, expect:
- 90-95% accuracy on classification and extraction
- Within 5-10% of the API model on generation quality
- Format compliance above 98%
If the fine-tuned model falls short:
- Add more training examples in the areas where it underperforms
- Check for data quality issues (mislabelled examples, inconsistent formats)
- Try a larger base model (7B → 14B)
- Increase the LoRA rank for more capacity
Most quality gaps are fixed with better data, not bigger models.
Phase 3: Validate (Days 46-60)
Week 7-8: Shadow Deployment
Deploy your fine-tuned model alongside the API. Route all production traffic through both models, but only serve the API model's response to users.
Compare outputs in real-time:
- Log both responses for every request
- Flag disagreements for human review
- Track quality metrics over real production traffic (not just test set performance)
- Monitor for edge cases that didn't appear in your training data
Shadow deployment catches issues that static evaluation misses:
- Input distribution shifts (real traffic patterns differ from training data)
- Rare edge cases (inputs your test set didn't cover)
- Format variations (users don't always write like your training examples)
Week 8-9: A/B Test
Once shadow deployment confirms quality parity, run a real A/B test:
- Route 10-20% of production traffic to the fine-tuned model
- Serve the fine-tuned model's response to those users
- Compare business metrics: user satisfaction, task completion rate, error rate
- Expand to 50% if metrics hold
- Monitor for at least one full week at each traffic percentage
Decision criteria for proceeding:
- Quality within 5% of the API model on your key metrics
- No increase in user complaints or error reports
- Format compliance above 95%
- Latency within acceptable range for your application
Start your 90-day migration. Ertas handles the hardest parts — dataset prep, training, evaluation, GGUF export — all in a visual interface. Pre-subscribe at early-bird pricing →
Phase 4: Expand (Days 61-90)
Week 9-10: Production Cutover for Pilot Task
With A/B testing validated, route 100% of your pilot task traffic to the fine-tuned model.
Cutover checklist:
- Export model to GGUF format
- Deploy on your production inference infrastructure (Ollama, vLLM, or llama.cpp)
- Configure monitoring and alerting for quality metrics
- Maintain API fallback (keep the API integration live but dormant — you can route back if needed)
- Update your documentation and runbooks
Measure the impact:
- Monthly cost reduction (API bill decrease)
- Latency improvement (local inference is typically faster)
- Reliability improvement (no dependency on external API uptime)
- Quality metrics (should maintain the levels validated during A/B testing)
Week 11-12: Begin Next Migration
Apply the same process to your second-highest priority task. This goes faster because you've built the institutional knowledge:
- Your data pipeline is established
- Your evaluation framework exists
- Your deployment infrastructure is running
- Your team understands the fine-tuning workflow
Typical time for subsequent migrations: 3-4 weeks (versus 6 weeks for the first one).
Week 12: Establish Ongoing Cadence
Set up the systems that keep your fine-tuned models current:
Retraining schedule. As your business evolves, your models need updates. Monthly or quarterly retraining with fresh data keeps performance high. Use your production logs as new training data — the model's own outputs (validated by humans) feed back into future training.
Quality monitoring. Track accuracy metrics on an ongoing basis. Set alerts for quality degradation. If accuracy drops below your threshold, trigger a retraining cycle.
Version management. Keep previous model versions available for rollback. Track which model version is deployed in each environment.
Common Pitfalls (and How to Avoid Them)
Pitfall 1: Trying to Migrate Everything at Once
The mistake: Spending weeks building an elaborate migration plan for all 8 AI tasks, then attempting to execute in parallel.
The fix: Ship one migration first. Learn from it. Apply those learnings to the next one. Sequential beats parallel when you're building new organisational capability.
Pitfall 2: Insufficient Training Data Quality
The mistake: Dumping 10,000 raw API logs into a training dataset without review. The logs include incorrect outputs, inconsistent formats, and edge cases the API model handled poorly.
The fix: Spend more time on data curation and less on data volume. Review examples. Remove bad ones. Ensure format consistency. A curated dataset of 800 examples outperforms an unreviewed dataset of 5,000.
Pitfall 3: Skipping Shadow Deployment
The mistake: Going straight from evaluation on a test set to production deployment. The test set doesn't capture the full distribution of real-world inputs.
The fix: Always shadow deploy. Always A/B test. The extra 2-3 weeks of validation prevent production incidents that take longer than 2-3 weeks to recover from.
Pitfall 4: Optimising for the Wrong Metric
The mistake: Pursuing 99% accuracy when your API model only achieves 85%. The fine-tuned model hits 92% — better than the API — but the team keeps iterating because it's not "perfect."
The fix: Your benchmark is the current API model, not theoretical perfection. If the fine-tuned model matches or exceeds the API on your metrics, that's a successful migration.
Pitfall 5: Forgetting the Fallback
The mistake: Removing the API integration after migrating to the fine-tuned model. Three months later, you need to retrain the model and have no fallback during the training window.
The fix: Keep the API integration dormant. You're not paying for it if you're not calling it. But having it available for emergencies — even briefly — is worth the minimal maintenance cost.
The Ertas Shortcut
The playbook above works with any fine-tuning toolchain. But much of the manual work — GPU provisioning, training configuration, dataset formatting, GGUF export — can be compressed with the right platform.
With Ertas, Phases 2-3 compress significantly:
- Dataset upload replaces manual JSONL preparation (or use the visual editor)
- One-click training replaces GPU setup, config files, and monitoring scripts
- Built-in evaluation replaces custom evaluation pipelines
- Side-by-side comparison across experiments replaces manual tracking
- GGUF export replaces quantisation toolchains
A migration that takes 6 weeks with manual tooling can compress to 2-3 weeks with an integrated platform. The hardest parts — the ones where teams get stuck — are exactly the parts the platform handles.
After 90 Days
At the end of this playbook, you should have:
- 1-2 production tasks running on fine-tuned models you own
- Proven cost savings documented and quantified
- An evaluation framework ready for future migrations
- Deployment infrastructure running and monitored
- A prioritised list of next tasks to migrate
- Institutional knowledge of the fine-tuning workflow
You're no longer fully API-dependent. You own critical AI capabilities. Your costs are more predictable. Your product is more resilient.
And the next time an AI provider sends a deprecation notice or a pricing change, you'll have options — not just obligations.
Start your migration. Ertas handles the entire pipeline — dataset to GGUF — in a visual interface, no code required. Pre-subscribe at early-bird pricing. See plans →
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Distilling Claude/GPT into a 7B Model for Production: Step-by-Step
A step-by-step tutorial for distilling the capabilities of Claude or GPT-4o into a 7B parameter model for local production deployment — from dataset generation through fine-tuning to GGUF export.

How to Distill Open-Source Models Legally: A Step-by-Step Guide
A practical guide to model distillation the right way: using open-source teacher models with permissive licenses, your own domain data, and a clear legal path to model ownership.

What Happens When Your AI Provider Cuts You Off? A Survival Guide
Anthropic banned 24,000 accounts overnight. OpenAI deprecated GPT-4o with 2 weeks notice. Your AI provider can change the rules at any time. Here's your survival guide for vendor dependency.