Back to blog
    From Prototype to Product: Replacing API Calls with Fine-Tuned Models
    productionmigrationfine-tuningindie-devsegment:vibecoder

    From Prototype to Product: Replacing API Calls with Fine-Tuned Models

    Your Lovable/Bolt prototype works. Users are signing up. But every API call eats your margin. Here's the step-by-step playbook for migrating from cloud APIs to fine-tuned local models in production.

    EErtas Team·

    Your prototype works. You built it with Lovable or Bolt or Replit, plugged in the OpenAI API, and users started signing up. Maybe you've got 500 users. Maybe 2,000. The product-market fit feels real.

    But here's the problem you didn't think about at the "let's just ship it" stage: every single AI interaction costs money. Not a lot per call — maybe $0.003 to $0.02 depending on the model and token count — but it adds up. At 2,000 users making 10 AI requests per day, you're burning through $600–$1,200/month in API costs alone.

    Your prototype works. Your unit economics don't.

    This guide is the step-by-step playbook for migrating from cloud APIs to fine-tuned local models. It's written for vibecoders who have a working product and need to make the economics work — without breaking what's already working.

    The Prototype-to-Production Gap Nobody Talks About

    Every AI tutorial ends at the same place: "and now your app uses GPT-4!" Nobody talks about what happens at month 3 when you've got users and the API bill is a significant line item.

    The gap between prototype and production isn't code quality or test coverage. For vibe-coded apps, it's cost structure. Your prototype assumed infinite cheap API calls. Production requires predictable, bounded costs.

    Here's the good news: you don't have to rewrite your app. You don't need to become an ML engineer. You need to replace the most expensive API calls with a model you own and run yourself. The rest stays the same.

    Phase 1: Audit Your API Usage

    Before you change anything, understand what you're spending and where. Log every API call for 2 weeks. For each call, track:

    • Endpoint/feature: Which feature in your app triggered it?
    • Input tokens: How big is the prompt?
    • Output tokens: How big is the response?
    • Frequency: How many times per day does this call happen?
    • Cost per call: Input tokens × input price + output tokens × output price.

    You'll usually find something like this:

    FeatureCalls/DayAvg Cost/CallDaily CostMonthly Cost% of Total
    Chat responses8,000$0.008$64$1,92062%
    Content classification4,500$0.002$9$2709%
    Summary generation3,200$0.005$16$48015%
    Tone analysis2,800$0.001$2.80$843%
    Complex analysis600$0.06$36$1,08011%

    This audit is the most important step. It tells you exactly where to focus.

    Phase 2: Identify Fine-Tuning Candidates

    Not every API call should be replaced. Look for calls that are:

    • High volume: The calls that happen thousands of times per day. Even small per-call savings add up.
    • Repetitive: The model does basically the same type of task every time. Classify this. Summarize that. Extract these fields.
    • Domain-specific: Your app has a specific vocabulary, format, or domain. Generic GPT-4 is overkill.
    • Template-like outputs: The response follows a consistent structure or format.

    In the example above, the best candidates are:

    1. Content classification — High volume, simple task, structured output. A fine-tuned 3.8B model handles this easily.
    2. Tone analysis — Simple classification task. Doesn't need a frontier model.
    3. Summary generation — Repetitive task with consistent format. Fine-tuned 7B model territory.
    4. Chat responses — The biggest spend. More complex, but if your chat follows patterns (customer support, domain Q&A), a fine-tuned 8B model works well.

    What stays on the API:

    1. Complex analysis — Low volume, high complexity, variable reasoning. Keep this on GPT-4o or Claude. It's only 11% of your spend but needs frontier-model capability.

    Phase 3: Collect Training Data from Your API Logs

    You've been paying OpenAI for months. The silver lining: every API call you've made is a training example.

    For each fine-tuning candidate, extract 2,000–5,000 input-output pairs from your logs. Filter for quality:

    • Remove examples where users complained or corrected the output
    • Remove examples with error responses or truncated outputs
    • Remove outliers (abnormally long prompts, edge cases)
    • Keep examples that represent the "happy path" of each feature

    Format as JSONL:

    {"input": "Classify the following customer message: 'I need to update my billing address'", "output": "category: account_management, intent: update_info, urgency: low"}
    

    Pro tip: if you don't have enough examples for a specific feature, run your current API setup for another week and log aggressively. 2,000 clean examples is the minimum for a solid fine-tune. 5,000 is better.

    Phase 4: Fine-Tune with Ertas

    For each candidate feature, train a separate LoRA adapter. Why separate? Because each feature has different requirements, and separate adapters let you update one without touching others.

    Here's the actual workflow:

    1. Upload your dataset to Ertas Vault. One dataset per feature.
    2. Select your base model. For simple tasks (classification, extraction): Phi-4 3.8B. For moderate tasks (summaries, generation): Qwen 2.5 7B or Llama 3.3 8B.
    3. Configure training. The defaults work well. If you want to tweak: 3–5 epochs, learning rate 2e-4, LoRA rank 16–32.
    4. Train. Takes 20–60 minutes per adapter depending on dataset size.
    5. Evaluate. Ertas shows you accuracy metrics against a held-out test set. You want 90%+ match rate on your task before proceeding.

    If accuracy is below 85%, the fix is almost always data quality. Go back to your training examples, remove noisy ones, add more representative ones, and retrain.

    Phase 5: Deploy Alongside Your App

    Export each fine-tuned model as a GGUF file (Q5_K_M quantization). Deploy to a VPS running Ollama.

    Your production setup looks like this:

    ┌─────────────┐     ┌──────────────────────┐
    │  Your App    │────▶│  Your VPS (Ollama)   │
    │  (Vercel /   │     │                      │
    │   Railway)   │     │  ├─ classify-model   │
    │              │     │  ├─ summary-model    │
    │              │────▶│  └─ chat-model       │
    │              │     └──────────────────────┘
    │              │
    │              │     ┌──────────────────────┐
    │              │────▶│  OpenAI API          │
    │              │     │  (complex tasks only)│
    └─────────────┘     └──────────────────────┘
    

    Ollama serves an OpenAI-compatible API on port 11434. Your app code barely changes — you're swapping a URL and model name, not rewriting logic.

    For multiple models on one server, Ollama handles model loading and unloading automatically. A 32GB RAM VPS ($50–80/mo) comfortably serves 2–3 fine-tuned 7B models.

    Phase 6: Gradual Migration

    Don't flip the switch all at once. Migrate gradually:

    Week 1 — Shadow mode. Send every request to both your fine-tuned model AND the API. Compare outputs. Don't serve the fine-tuned output to users yet. Log discrepancies.

    Week 2 — 10% routing. Route 10% of traffic to the fine-tuned model. Monitor user behavior — are completion rates, engagement metrics, and error rates stable?

    Week 3 — 50% routing. Half your traffic goes to the fine-tuned model. Your API bill should drop noticeably. Watch for edge cases the fine-tuned model mishandles.

    Week 4 — Full cutover. Route all candidate traffic to the fine-tuned model. Keep the API as a fallback for requests where confidence is low.

    This gradual approach lets you catch problems before they affect all your users. It's the difference between "we migrated smoothly" and "we broke the app for a day."

    The Before and After Architecture

    Before: All API

    • Every AI feature → OpenAI API
    • Cost: $3,834/month and growing
    • Risk: One API pricing change breaks your business
    • Latency: 200–800ms per request (network + inference)

    After: Hybrid

    • Classification, tone, summaries, chat → Fine-tuned models on VPS
    • Complex analysis → OpenAI API (kept for edge cases)
    • Cost: $80/month (VPS) + $14.50/month (Ertas) + ~$120/month (API for complex tasks) = $214.50/month
    • Savings: $3,619/month (94% reduction)
    • Latency: 50–200ms for local models, faster than API

    That's not a typo. Going from $3,834 to $214.50 is what happens when you stop paying per token for tasks a small fine-tuned model handles better anyway.

    What to Keep on APIs

    Be honest about what fine-tuned models can't do:

    • Open-ended creative tasks where the user can ask literally anything
    • Complex multi-step reasoning that requires a frontier model's capability
    • Tasks where you have < 500 training examples (not enough to fine-tune well)
    • New features you're still iterating on (use the API until the feature stabilizes, then fine-tune)

    The goal isn't zero API costs. It's routing the right requests to the right models. Frontier models for frontier tasks. Your models for your tasks.

    The Migration Mindset

    Here's the thing most vibecoders miss: the migration from API to fine-tuned models isn't a one-time project. It's a mindset shift.

    Every new feature starts on the API. That's fine — it's the fastest way to validate. But once a feature stabilizes and you know what good output looks like, you fine-tune. You move it to your stack. You own it.

    Over time, your API costs asymptotically approach a small fixed number — just the genuinely complex stuff. Everything routine runs on models you own, on infrastructure you control, at costs that don't scale with your users.

    That's what production looks like for a vibe-coded app in 2026.


    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading