Back to blog
    From API-Dependent to Model Owner: A 90-Day Migration Playbook
    migrationmodel-ownershipfine-tuningself-hostedplaybookvendor-lock-in

    From API-Dependent to Model Owner: A 90-Day Migration Playbook

    A phased, risk-managed plan for migrating your AI workloads from cloud APIs to fine-tuned models you own. Week-by-week breakdown with concrete milestones for each phase.

    EErtas Team·

    You've read about vendor dependency risks. You've done the independence checklist. You know the cost math works in favour of owned models. Now you need a plan.

    This playbook covers the first 90 days of migrating from API-dependent to model-owning. It's designed for teams without ML expertise, assuming you have access to your API logs and domain data. The goal isn't to eliminate all API usage — it's to own your most critical AI capabilities and build a foundation for continued independence.

    Before You Start: The Migration Mindset

    Two principles make the difference between a smooth migration and a painful one:

    Parallel run, not cold switch. You're not ripping out your API integration and replacing it with a fine-tuned model on day one. You're running both side-by-side, comparing quality, and routing traffic gradually. The API stays live until the fine-tuned model proves itself.

    Start narrow, expand systematically. Don't try to migrate everything at once. Pick one task. Get it right. Build confidence and institutional knowledge. Then repeat.

    Phase 1: Audit (Days 1-14)

    Week 1: Inventory Your AI Touchpoints

    Map every place your application or workflow calls an AI API. For each touchpoint, document:

    FieldExample
    Task descriptionClassify support tickets into categories
    Provider/modelOpenAI GPT-4o-mini
    Monthly volume12,000 requests
    Monthly cost$340
    Input formatUnstructured text (1-3 paragraphs)
    Output formatSingle category label from predefined list
    Quality requirement90%+ accuracy
    CriticalityHigh — routes tickets to correct team
    Training data availableYes — 18 months of classified tickets in CRM

    Most teams discover they have 3-8 distinct AI tasks in production. Some have more.

    Week 2: Score and Prioritise

    Score each task on three dimensions:

    Fine-tuning suitability (1-5):

    • Consistent input/output format → higher score
    • Large volume → higher score
    • Available training data → higher score
    • Domain-specific vocabulary or knowledge → higher score
    • Subjective or creative output → lower score

    Business impact (1-5):

    • High monthly cost → higher score
    • Customer-facing → higher score
    • SLA-sensitive → higher score
    • Revenue-generating → higher score

    Migration complexity (1-5, lower is better):

    • Simple classification/extraction → low complexity
    • Multi-step reasoning → medium complexity
    • Open-ended generation → higher complexity
    • Multi-modal (text + images) → highest complexity

    Priority = Suitability × Impact ÷ Complexity

    Your highest-scoring task is your pilot migration target. In most businesses, it's one of these:

    • Customer support ticket classification/routing
    • Content generation in a specific format
    • Data extraction from structured documents
    • FAQ/knowledge base response generation
    • Lead qualification or scoring

    Phase 2: Pilot (Days 15-45)

    Week 3: Prepare Your Training Dataset

    Your API logs are your training data. Extract input/output pairs from your production system.

    Minimum dataset size: 500 high-quality examples. This is enough for a well-defined task with consistent format.

    Recommended: 1,000-2,000 examples. Gives the model more edge cases to learn from.

    Quality over quantity. 500 carefully reviewed examples outperform 5,000 noisy ones. Spend time on data quality, not just volume.

    Dataset preparation steps:

    1. Export raw data. Pull input/output pairs from your API logs, CRM, or database. Format as JSONL with the chat message structure your training tool expects.

    2. Filter for quality. Remove examples where the API output was incorrect, poorly formatted, or required manual correction. You want only examples of the task done right.

    3. Deduplicate. Near-identical examples add noise. Remove duplicates and near-duplicates.

    4. Balance categories. If you're training a classifier, ensure reasonable representation across all categories. Extreme imbalance (90% category A, 2% category B) causes the model to underperform on minority categories.

    5. Split the data. Reserve 10-15% as a test set that won't be used in training. This is your evaluation benchmark.

    Week 4-5: Fine-Tune the Model

    Select your base model. For most business tasks:

    • 7B parameters — Fast inference, runs on consumer hardware, good for classification and extraction
    • 14B parameters — Better for generation tasks, requires more compute but still practical
    • Llama 3, Qwen 2.5, or Mistral — All production-quality, all commercially permissive

    Choose your training approach:

    • LoRA/QLoRA — The standard approach. Trains lightweight adapters (50-200MB) on top of frozen base weights. Memory-efficient, fast to train, and the adapter is portable.
    • Full fine-tuning — Modifies all weights. Better for complex tasks but requires more compute. Usually unnecessary for well-defined business tasks.

    Training configuration (starting point):

    • Learning rate: 2e-4
    • Batch size: 4-8
    • Epochs: 2-3
    • LoRA rank: 32

    Using Ertas: Upload your JSONL dataset, select your base model, and start training. The platform handles GPU provisioning, hyperparameter management, and progress tracking. Setup takes about 2 minutes. Training time depends on dataset size and model — typically 15-60 minutes for a LoRA fine-tune.

    Run 2-3 experiments. Try different base models, LoRA ranks, or training durations. Side-by-side comparison across experiments helps you find the best configuration.

    Week 6: Evaluate

    Run your held-out test set through both the API model and your fine-tuned model. Compare:

    Quantitative metrics:

    • Accuracy (for classification/extraction tasks)
    • Format compliance (does the output match your expected structure?)
    • Consistency (same answer for equivalent inputs?)
    • Latency (response time per request)

    Quality threshold: For domain-specific tasks with good training data, expect:

    • 90-95% accuracy on classification and extraction
    • Within 5-10% of the API model on generation quality
    • Format compliance above 98%

    If the fine-tuned model falls short:

    • Add more training examples in the areas where it underperforms
    • Check for data quality issues (mislabelled examples, inconsistent formats)
    • Try a larger base model (7B → 14B)
    • Increase the LoRA rank for more capacity

    Most quality gaps are fixed with better data, not bigger models.

    Phase 3: Validate (Days 46-60)

    Week 7-8: Shadow Deployment

    Deploy your fine-tuned model alongside the API. Route all production traffic through both models, but only serve the API model's response to users.

    Compare outputs in real-time:

    • Log both responses for every request
    • Flag disagreements for human review
    • Track quality metrics over real production traffic (not just test set performance)
    • Monitor for edge cases that didn't appear in your training data

    Shadow deployment catches issues that static evaluation misses:

    • Input distribution shifts (real traffic patterns differ from training data)
    • Rare edge cases (inputs your test set didn't cover)
    • Format variations (users don't always write like your training examples)

    Week 8-9: A/B Test

    Once shadow deployment confirms quality parity, run a real A/B test:

    • Route 10-20% of production traffic to the fine-tuned model
    • Serve the fine-tuned model's response to those users
    • Compare business metrics: user satisfaction, task completion rate, error rate
    • Expand to 50% if metrics hold
    • Monitor for at least one full week at each traffic percentage

    Decision criteria for proceeding:

    • Quality within 5% of the API model on your key metrics
    • No increase in user complaints or error reports
    • Format compliance above 95%
    • Latency within acceptable range for your application

    Start your 90-day migration. Ertas handles the hardest parts — dataset prep, training, evaluation, GGUF export — all in a visual interface. Pre-subscribe at early-bird pricing →

    Phase 4: Expand (Days 61-90)

    Week 9-10: Production Cutover for Pilot Task

    With A/B testing validated, route 100% of your pilot task traffic to the fine-tuned model.

    Cutover checklist:

    • Export model to GGUF format
    • Deploy on your production inference infrastructure (Ollama, vLLM, or llama.cpp)
    • Configure monitoring and alerting for quality metrics
    • Maintain API fallback (keep the API integration live but dormant — you can route back if needed)
    • Update your documentation and runbooks

    Measure the impact:

    • Monthly cost reduction (API bill decrease)
    • Latency improvement (local inference is typically faster)
    • Reliability improvement (no dependency on external API uptime)
    • Quality metrics (should maintain the levels validated during A/B testing)

    Week 11-12: Begin Next Migration

    Apply the same process to your second-highest priority task. This goes faster because you've built the institutional knowledge:

    • Your data pipeline is established
    • Your evaluation framework exists
    • Your deployment infrastructure is running
    • Your team understands the fine-tuning workflow

    Typical time for subsequent migrations: 3-4 weeks (versus 6 weeks for the first one).

    Week 12: Establish Ongoing Cadence

    Set up the systems that keep your fine-tuned models current:

    Retraining schedule. As your business evolves, your models need updates. Monthly or quarterly retraining with fresh data keeps performance high. Use your production logs as new training data — the model's own outputs (validated by humans) feed back into future training.

    Quality monitoring. Track accuracy metrics on an ongoing basis. Set alerts for quality degradation. If accuracy drops below your threshold, trigger a retraining cycle.

    Version management. Keep previous model versions available for rollback. Track which model version is deployed in each environment.

    Common Pitfalls (and How to Avoid Them)

    Pitfall 1: Trying to Migrate Everything at Once

    The mistake: Spending weeks building an elaborate migration plan for all 8 AI tasks, then attempting to execute in parallel.

    The fix: Ship one migration first. Learn from it. Apply those learnings to the next one. Sequential beats parallel when you're building new organisational capability.

    Pitfall 2: Insufficient Training Data Quality

    The mistake: Dumping 10,000 raw API logs into a training dataset without review. The logs include incorrect outputs, inconsistent formats, and edge cases the API model handled poorly.

    The fix: Spend more time on data curation and less on data volume. Review examples. Remove bad ones. Ensure format consistency. A curated dataset of 800 examples outperforms an unreviewed dataset of 5,000.

    Pitfall 3: Skipping Shadow Deployment

    The mistake: Going straight from evaluation on a test set to production deployment. The test set doesn't capture the full distribution of real-world inputs.

    The fix: Always shadow deploy. Always A/B test. The extra 2-3 weeks of validation prevent production incidents that take longer than 2-3 weeks to recover from.

    Pitfall 4: Optimising for the Wrong Metric

    The mistake: Pursuing 99% accuracy when your API model only achieves 85%. The fine-tuned model hits 92% — better than the API — but the team keeps iterating because it's not "perfect."

    The fix: Your benchmark is the current API model, not theoretical perfection. If the fine-tuned model matches or exceeds the API on your metrics, that's a successful migration.

    Pitfall 5: Forgetting the Fallback

    The mistake: Removing the API integration after migrating to the fine-tuned model. Three months later, you need to retrain the model and have no fallback during the training window.

    The fix: Keep the API integration dormant. You're not paying for it if you're not calling it. But having it available for emergencies — even briefly — is worth the minimal maintenance cost.

    The Ertas Shortcut

    The playbook above works with any fine-tuning toolchain. But much of the manual work — GPU provisioning, training configuration, dataset formatting, GGUF export — can be compressed with the right platform.

    With Ertas, Phases 2-3 compress significantly:

    • Dataset upload replaces manual JSONL preparation (or use the visual editor)
    • One-click training replaces GPU setup, config files, and monitoring scripts
    • Built-in evaluation replaces custom evaluation pipelines
    • Side-by-side comparison across experiments replaces manual tracking
    • GGUF export replaces quantisation toolchains

    A migration that takes 6 weeks with manual tooling can compress to 2-3 weeks with an integrated platform. The hardest parts — the ones where teams get stuck — are exactly the parts the platform handles.

    After 90 Days

    At the end of this playbook, you should have:

    • 1-2 production tasks running on fine-tuned models you own
    • Proven cost savings documented and quantified
    • An evaluation framework ready for future migrations
    • Deployment infrastructure running and monitored
    • A prioritised list of next tasks to migrate
    • Institutional knowledge of the fine-tuning workflow

    You're no longer fully API-dependent. You own critical AI capabilities. Your costs are more predictable. Your product is more resilient.

    And the next time an AI provider sends a deprecation notice or a pricing change, you'll have options — not just obligations.


    Start your migration. Ertas handles the entire pipeline — dataset to GGUF — in a visual interface, no code required. Pre-subscribe at early-bird pricing. See plans →

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading