Back to blog
    A/B Testing Your Fine-Tuned Model Against GPT-4 in Production
    ab-testingfine-tuninggpt4productionmigrationevaluationsaas

    A/B Testing Your Fine-Tuned Model Against GPT-4 in Production

    How to safely migrate from cloud AI APIs to a fine-tuned model by running a production A/B test. Covers routing architecture, what to measure, statistical significance, and the gradual migration path from 10% to 100%.

    EErtas Team·

    You've fine-tuned a model on your domain data. The eval results look promising. Offline tests show accuracy matching GPT-4 on your specific tasks. But deploying to production feels risky — what if it's worse for real users in ways your eval dataset didn't capture?

    The answer isn't guessing. It's A/B testing. Route a percentage of production traffic to your fine-tuned model, measure outcomes, and migrate gradually as confidence builds.

    The Routing Architecture

    The simplest implementation: an API gateway that routes requests to either your cloud API or your local fine-tuned model based on a split percentage.

    User Request
         ↓
    API Gateway (router)
         ↓
    ┌─────────────────────┐
    │  10% → Ollama       │  (fine-tuned model, local)
    │  90% → OpenAI API   │  (GPT-4o, cloud)
    └─────────────────────┘
         ↓
    Log: model_used, input, output, metrics
         ↓
    Response to User
    

    Implementation options:

    • Simple: Random assignment per request (Math.random() < 0.1 → Ollama, else → OpenAI)
    • Better: Consistent assignment per user session (same user always gets the same model during a session, preventing inconsistent behavior)
    • Best: Feature flag service (LaunchDarkly, PostHog, custom) that lets you adjust the split without deployment

    Both Ollama and OpenAI expose OpenAI-compatible APIs, so the router's job is just URL switching. The request format is identical.

    What to Measure

    Primary Metrics (Business Outcomes)

    These tell you whether the fine-tuned model produces the same user outcomes as GPT-4:

    MetricWhat it measuresHow to capture
    Task completion rateDid the user accomplish what they came to do?Track downstream actions (e.g., user clicks "resolved" after support response)
    User satisfactionIs the user happy with the AI output?Thumbs up/down on responses, NPS, or CSAT
    Error rateDid the model produce outputs requiring human correction?Track manual overrides or corrections
    Conversion impactDoes the AI feature drive the desired business action?Track conversion events downstream of AI interaction

    Secondary Metrics (Model Quality)

    These tell you about the model's technical performance:

    MetricWhat it measures
    Response latencyTime from request to complete response (local should be faster)
    Format compliancePercentage of responses matching expected output structure
    Hallucination rateFactual errors in responses (requires spot-check review)
    Fallback rateHow often the model produces a non-response or error

    Cost Metrics

    MetricGPT-4oFine-tuned local
    Cost per request$0.003-0.01~$0
    Monthly cost (at test volume)Track actual spendHardware cost (fixed)
    Projected monthly cost at 100%Extrapolate from testSame fixed cost

    How Many Queries You Need

    A/B tests require statistical significance. The number of queries needed depends on the difference you're trying to detect and your current metrics:

    Rule of thumb: To detect a 5% difference in task completion rate with 95% confidence:

    • If current rate is 80%: ~1,500 queries per variant
    • If current rate is 90%: ~3,500 queries per variant
    • If current rate is 95%: ~7,300 queries per variant

    At 500 queries per day, a 10% split sends ~50 queries/day to the fine-tuned model. At that rate:

    • Detecting a 5% difference: ~30-70 days
    • Detecting a 10% difference: ~7-18 days

    Practical guidance: Run the test for at least 2 weeks regardless of volume, to capture day-of-week and time-of-day patterns.

    The Gradual Migration Path

    Phase 1: Validation (10% traffic, 2-4 weeks)

    Route 10% of traffic to the fine-tuned model. Monitor all metrics. The goal isn't to prove the fine-tuned model is better — it's to prove it's not significantly worse.

    Pass criteria:

    • Task completion rate within 3% of GPT-4
    • No increase in error rate
    • No increase in user complaints
    • Format compliance meets your threshold

    If the fine-tuned model fails these criteria, don't abandon it. Investigate the failures, add training examples for the failure patterns, retrain, and test again.

    Phase 2: Expansion (50% traffic, 2-4 weeks)

    If Phase 1 passes, increase to 50%. At this volume, you'll see rarer edge cases surface. Monitor the same metrics.

    This phase also tests infrastructure: can your local hardware handle 50% of production traffic? Are there latency issues under load? Does Ollama behave well under sustained request volume?

    Phase 3: Majority (90% traffic, 1-2 weeks)

    Flip the ratio: 90% fine-tuned, 10% GPT-4. The GPT-4 path serves as a control group and fallback. This is your "soft launch" — the fine-tuned model handles the vast majority of traffic, but you still have a safety net.

    Phase 4: Full Migration (100% traffic)

    Disable the GPT-4 path. All traffic goes through your fine-tuned model. Keep GPT-4 credentials active for emergency rollback.

    Total migration timeline: 6-12 weeks from first test to full migration. This timeline feels slow, but the confidence you gain is worth the patience. Shipping a bad model to 100% of users is far more expensive than an extra month of testing.

    Common A/B Test Results

    Based on typical fine-tuning outcomes:

    Result 1: Fine-Tuned Model Wins on Domain Tasks

    The most common result. The fine-tuned model outperforms GPT-4 on your specific domain tasks — better accuracy, more consistent format, fewer hallucinations about your product.

    Action: Migrate to fine-tuned model. The cost savings are a bonus on top of quality improvement.

    Result 2: Fine-Tuned Model Matches GPT-4

    The fine-tuned model produces equivalent outcomes. Metrics are within noise of each other.

    Action: Migrate to fine-tuned model. Equal quality at dramatically lower cost is an easy decision.

    Result 3: Fine-Tuned Model Loses on Specific Scenarios

    The fine-tuned model handles 90% of cases well but struggles with a specific subset — unusual queries, complex multi-step reasoning, or scenarios underrepresented in training data.

    Action: Two options:

    1. Add training examples for the failure scenarios and retrain
    2. Use a hybrid approach: fine-tuned model for the 90%, GPT-4 fallback for the 10% (still massive cost savings)

    Result 4: Fine-Tuned Model Is Significantly Worse

    This is rare when you've done proper offline evaluation first. If it happens:

    • Check your training data quality (data quality checklist)
    • Verify quantization isn't causing issues (test at Q8_0 instead of Q4_K_M)
    • Ensure the base model is appropriate for your task
    • Consider whether your task genuinely needs frontier intelligence (when not to fine-tune)

    Implementation Tips

    Log Everything

    Log every request and response for both variants. You'll need this data for:

    • Debugging failures in the fine-tuned model
    • Building training data for the next retraining cycle
    • Proving to stakeholders that the migration was data-driven

    Use the Same Temperature

    Set temperature = 0 (or a fixed low value) for both models during the test. Variable temperature introduces noise that makes comparison harder.

    Test the Full Pipeline

    Don't just test model output quality. Test the full pipeline: request → model → response parsing → downstream action. A model that produces correct output in a slightly different format can break your parser.

    Have a Rollback Plan

    Keep the GPT-4 path operational even after full migration. If something goes wrong in production, you should be able to flip back to GPT-4 within minutes. This is a configuration change, not a deployment.

    Communicate With Your Team

    Let customer support, sales, and other teams know about the test. If users report quality differences, support needs to know which model served the response (check your logs).

    Getting Started

    1. Fine-tune your model on Ertas and validate offline with your eval dataset
    2. Deploy via Ollama alongside your existing OpenAI integration
    3. Implement a simple router (10% Ollama / 90% OpenAI)
    4. Add logging for all metrics (task completion, satisfaction, errors, latency, cost)
    5. Run Phase 1 for 2-4 weeks
    6. Analyze results, adjust training if needed, and progress through phases

    The A/B test removes the risk from migration. You're not guessing whether your fine-tuned model is good enough — you're measuring it. And in most cases, the data says: it's better AND cheaper.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading