From API-Dependent to Model Owner: A 90-Day Migration Playbook

You've read about vendor dependency risks. You've done the independence checklist. You know the cost math works in favour of owned models. Now you need a plan.

This playbook covers the first 90 days of migrating from API-dependent to model-owning. It's designed for teams without ML expertise, assuming you have access to your API logs and domain data. The goal isn't to eliminate all API usage — it's to own your most critical AI capabilities and build a foundation for continued independence.

Before You Start: The Migration Mindset

Two principles make the difference between a smooth migration and a painful one:

Parallel run, not cold switch. You're not ripping out your API integration and replacing it with a fine-tuned model on day one. You're running both side-by-side, comparing quality, and routing traffic gradually. The API stays live until the fine-tuned model proves itself.

Start narrow, expand systematically. Don't try to migrate everything at once. Pick one task. Get it right. Build confidence and institutional knowledge. Then repeat.

Phase 1: Audit (Days 1-14)

Week 1: Inventory Your AI Touchpoints

Map every place your application or workflow calls an AI API. For each touchpoint, document:

Field	Example
Task description	Classify support tickets into categories
Provider/model	OpenAI GPT-4o-mini
Monthly volume	12,000 requests
Monthly cost	$340
Input format	Unstructured text (1-3 paragraphs)
Output format	Single category label from predefined list
Quality requirement	90%+ accuracy
Criticality	High — routes tickets to correct team
Training data available	Yes — 18 months of classified tickets in CRM

Most teams discover they have 3-8 distinct AI tasks in production. Some have more.

Week 2: Score and Prioritise

Score each task on three dimensions:

Fine-tuning suitability (1-5):

Consistent input/output format → higher score
Large volume → higher score
Available training data → higher score
Domain-specific vocabulary or knowledge → higher score
Subjective or creative output → lower score

Business impact (1-5):

High monthly cost → higher score
Customer-facing → higher score
SLA-sensitive → higher score
Revenue-generating → higher score

Migration complexity (1-5, lower is better):

Simple classification/extraction → low complexity
Multi-step reasoning → medium complexity
Open-ended generation → higher complexity
Multi-modal (text + images) → highest complexity

Priority = Suitability × Impact ÷ Complexity

Your highest-scoring task is your pilot migration target. In most businesses, it's one of these:

Customer support ticket classification/routing
Content generation in a specific format
Data extraction from structured documents
FAQ/knowledge base response generation
Lead qualification or scoring

Phase 2: Pilot (Days 15-45)

Week 3: Prepare Your Training Dataset

Your API logs are your training data. Extract input/output pairs from your production system.

Minimum dataset size: 500 high-quality examples. This is enough for a well-defined task with consistent format.

Recommended: 1,000-2,000 examples. Gives the model more edge cases to learn from.

Quality over quantity. 500 carefully reviewed examples outperform 5,000 noisy ones. Spend time on data quality, not just volume.

Dataset preparation steps:

Export raw data. Pull input/output pairs from your API logs, CRM, or database. Format as JSONL with the chat message structure your training tool expects.
Filter for quality. Remove examples where the API output was incorrect, poorly formatted, or required manual correction. You want only examples of the task done right.
Deduplicate. Near-identical examples add noise. Remove duplicates and near-duplicates.
Balance categories. If you're training a classifier, ensure reasonable representation across all categories. Extreme imbalance (90% category A, 2% category B) causes the model to underperform on minority categories.
Split the data. Reserve 10-15% as a test set that won't be used in training. This is your evaluation benchmark.

Week 4-5: Fine-Tune the Model

Select your base model. For most business tasks:

7B parameters — Fast inference, runs on consumer hardware, good for classification and extraction
14B parameters — Better for generation tasks, requires more compute but still practical
Llama 3, Qwen 2.5, or Mistral — All production-quality, all commercially permissive

Choose your training approach:

LoRA/QLoRA — The standard approach. Trains lightweight adapters (50-200MB) on top of frozen base weights. Memory-efficient, fast to train, and the adapter is portable.
Full fine-tuning — Modifies all weights. Better for complex tasks but requires more compute. Usually unnecessary for well-defined business tasks.

Training configuration (starting point):

Learning rate: 2e-4
Batch size: 4-8
Epochs: 2-3
LoRA rank: 32

Using Ertas: Upload your JSONL dataset, select your base model, and start training. The platform handles GPU provisioning, hyperparameter management, and progress tracking. Setup takes about 2 minutes. Training time depends on dataset size and model — typically 15-60 minutes for a LoRA fine-tune.

Run 2-3 experiments. Try different base models, LoRA ranks, or training durations. Side-by-side comparison across experiments helps you find the best configuration.

Week 6: Evaluate

Run your held-out test set through both the API model and your fine-tuned model. Compare:

Quantitative metrics:

Accuracy (for classification/extraction tasks)
Format compliance (does the output match your expected structure?)
Consistency (same answer for equivalent inputs?)
Latency (response time per request)

Quality threshold: For domain-specific tasks with good training data, expect:

90-95% accuracy on classification and extraction
Within 5-10% of the API model on generation quality
Format compliance above 98%

If the fine-tuned model falls short:

Add more training examples in the areas where it underperforms
Check for data quality issues (mislabelled examples, inconsistent formats)
Try a larger base model (7B → 14B)
Increase the LoRA rank for more capacity

Most quality gaps are fixed with better data, not bigger models.

Phase 3: Validate (Days 46-60)

Week 7-8: Shadow Deployment

Deploy your fine-tuned model alongside the API. Route all production traffic through both models, but only serve the API model's response to users.

Compare outputs in real-time:

Log both responses for every request
Flag disagreements for human review
Track quality metrics over real production traffic (not just test set performance)
Monitor for edge cases that didn't appear in your training data

Shadow deployment catches issues that static evaluation misses:

Input distribution shifts (real traffic patterns differ from training data)
Rare edge cases (inputs your test set didn't cover)
Format variations (users don't always write like your training examples)

Week 8-9: A/B Test

Once shadow deployment confirms quality parity, run a real A/B test:

Route 10-20% of production traffic to the fine-tuned model
Serve the fine-tuned model's response to those users
Compare business metrics: user satisfaction, task completion rate, error rate
Expand to 50% if metrics hold
Monitor for at least one full week at each traffic percentage

Decision criteria for proceeding:

Quality within 5% of the API model on your key metrics
No increase in user complaints or error reports
Format compliance above 95%
Latency within acceptable range for your application

Start your 90-day migration. Ertas handles the hardest parts — dataset prep, training, evaluation, GGUF export — all in a visual interface. Pre-subscribe at early-bird pricing →

Phase 4: Expand (Days 61-90)

Week 9-10: Production Cutover for Pilot Task

With A/B testing validated, route 100% of your pilot task traffic to the fine-tuned model.

Cutover checklist:

Export model to GGUF format
Deploy on your production inference infrastructure (Ollama, vLLM, or llama.cpp)
Configure monitoring and alerting for quality metrics
Maintain API fallback (keep the API integration live but dormant — you can route back if needed)
Update your documentation and runbooks

Measure the impact:

Monthly cost reduction (API bill decrease)
Latency improvement (local inference is typically faster)
Reliability improvement (no dependency on external API uptime)
Quality metrics (should maintain the levels validated during A/B testing)

Week 11-12: Begin Next Migration

Apply the same process to your second-highest priority task. This goes faster because you've built the institutional knowledge:

Your data pipeline is established
Your evaluation framework exists
Your deployment infrastructure is running
Your team understands the fine-tuning workflow

Typical time for subsequent migrations: 3-4 weeks (versus 6 weeks for the first one).

Week 12: Establish Ongoing Cadence

Set up the systems that keep your fine-tuned models current:

Retraining schedule. As your business evolves, your models need updates. Monthly or quarterly retraining with fresh data keeps performance high. Use your production logs as new training data — the model's own outputs (validated by humans) feed back into future training.

Quality monitoring. Track accuracy metrics on an ongoing basis. Set alerts for quality degradation. If accuracy drops below your threshold, trigger a retraining cycle.

Version management. Keep previous model versions available for rollback. Track which model version is deployed in each environment.

Common Pitfalls (and How to Avoid Them)

Pitfall 1: Trying to Migrate Everything at Once

The mistake: Spending weeks building an elaborate migration plan for all 8 AI tasks, then attempting to execute in parallel.

The fix: Ship one migration first. Learn from it. Apply those learnings to the next one. Sequential beats parallel when you're building new organisational capability.

Pitfall 2: Insufficient Training Data Quality

The mistake: Dumping 10,000 raw API logs into a training dataset without review. The logs include incorrect outputs, inconsistent formats, and edge cases the API model handled poorly.

The fix: Spend more time on data curation and less on data volume. Review examples. Remove bad ones. Ensure format consistency. A curated dataset of 800 examples outperforms an unreviewed dataset of 5,000.

Pitfall 3: Skipping Shadow Deployment

The mistake: Going straight from evaluation on a test set to production deployment. The test set doesn't capture the full distribution of real-world inputs.

The fix: Always shadow deploy. Always A/B test. The extra 2-3 weeks of validation prevent production incidents that take longer than 2-3 weeks to recover from.

Pitfall 4: Optimising for the Wrong Metric

The mistake: Pursuing 99% accuracy when your API model only achieves 85%. The fine-tuned model hits 92% — better than the API — but the team keeps iterating because it's not "perfect."

The fix: Your benchmark is the current API model, not theoretical perfection. If the fine-tuned model matches or exceeds the API on your metrics, that's a successful migration.

Pitfall 5: Forgetting the Fallback

The mistake: Removing the API integration after migrating to the fine-tuned model. Three months later, you need to retrain the model and have no fallback during the training window.

The fix: Keep the API integration dormant. You're not paying for it if you're not calling it. But having it available for emergencies — even briefly — is worth the minimal maintenance cost.

The Ertas Shortcut

The playbook above works with any fine-tuning toolchain. But much of the manual work — GPU provisioning, training configuration, dataset formatting, GGUF export — can be compressed with the right platform.

With Ertas, Phases 2-3 compress significantly:

Dataset upload replaces manual JSONL preparation (or use the visual editor)
One-click training replaces GPU setup, config files, and monitoring scripts
Built-in evaluation replaces custom evaluation pipelines
Side-by-side comparison across experiments replaces manual tracking
GGUF export replaces quantisation toolchains

A migration that takes 6 weeks with manual tooling can compress to 2-3 weeks with an integrated platform. The hardest parts — the ones where teams get stuck — are exactly the parts the platform handles.

After 90 Days

At the end of this playbook, you should have:

1-2 production tasks running on fine-tuned models you own
Proven cost savings documented and quantified
An evaluation framework ready for future migrations
Deployment infrastructure running and monitored
A prioritised list of next tasks to migrate
Institutional knowledge of the fine-tuning workflow

You're no longer fully API-dependent. You own critical AI capabilities. Your costs are more predictable. Your product is more resilient.

And the next time an AI provider sends a deprecation notice or a pricing change, you'll have options — not just obligations.

Start your migration. Ertas handles the entire pipeline — dataset to GGUF — in a visual interface, no code required. Pre-subscribe at early-bird pricing. See plans →