
A/B Testing Your Fine-Tuned Model Against GPT-4 in Production
How to safely migrate from cloud AI APIs to a fine-tuned model by running a production A/B test. Covers routing architecture, what to measure, statistical significance, and the gradual migration path from 10% to 100%.
You've fine-tuned a model on your domain data. The eval results look promising. Offline tests show accuracy matching GPT-4 on your specific tasks. But deploying to production feels risky — what if it's worse for real users in ways your eval dataset didn't capture?
The answer isn't guessing. It's A/B testing. Route a percentage of production traffic to your fine-tuned model, measure outcomes, and migrate gradually as confidence builds.
The Routing Architecture
The simplest implementation: an API gateway that routes requests to either your cloud API or your local fine-tuned model based on a split percentage.
User Request
↓
API Gateway (router)
↓
┌─────────────────────┐
│ 10% → Ollama │ (fine-tuned model, local)
│ 90% → OpenAI API │ (GPT-4o, cloud)
└─────────────────────┘
↓
Log: model_used, input, output, metrics
↓
Response to User
Implementation options:
- Simple: Random assignment per request (Math.random() < 0.1 → Ollama, else → OpenAI)
- Better: Consistent assignment per user session (same user always gets the same model during a session, preventing inconsistent behavior)
- Best: Feature flag service (LaunchDarkly, PostHog, custom) that lets you adjust the split without deployment
Both Ollama and OpenAI expose OpenAI-compatible APIs, so the router's job is just URL switching. The request format is identical.
What to Measure
Primary Metrics (Business Outcomes)
These tell you whether the fine-tuned model produces the same user outcomes as GPT-4:
| Metric | What it measures | How to capture |
|---|---|---|
| Task completion rate | Did the user accomplish what they came to do? | Track downstream actions (e.g., user clicks "resolved" after support response) |
| User satisfaction | Is the user happy with the AI output? | Thumbs up/down on responses, NPS, or CSAT |
| Error rate | Did the model produce outputs requiring human correction? | Track manual overrides or corrections |
| Conversion impact | Does the AI feature drive the desired business action? | Track conversion events downstream of AI interaction |
Secondary Metrics (Model Quality)
These tell you about the model's technical performance:
| Metric | What it measures |
|---|---|
| Response latency | Time from request to complete response (local should be faster) |
| Format compliance | Percentage of responses matching expected output structure |
| Hallucination rate | Factual errors in responses (requires spot-check review) |
| Fallback rate | How often the model produces a non-response or error |
Cost Metrics
| Metric | GPT-4o | Fine-tuned local |
|---|---|---|
| Cost per request | $0.003-0.01 | ~$0 |
| Monthly cost (at test volume) | Track actual spend | Hardware cost (fixed) |
| Projected monthly cost at 100% | Extrapolate from test | Same fixed cost |
How Many Queries You Need
A/B tests require statistical significance. The number of queries needed depends on the difference you're trying to detect and your current metrics:
Rule of thumb: To detect a 5% difference in task completion rate with 95% confidence:
- If current rate is 80%: ~1,500 queries per variant
- If current rate is 90%: ~3,500 queries per variant
- If current rate is 95%: ~7,300 queries per variant
At 500 queries per day, a 10% split sends ~50 queries/day to the fine-tuned model. At that rate:
- Detecting a 5% difference: ~30-70 days
- Detecting a 10% difference: ~7-18 days
Practical guidance: Run the test for at least 2 weeks regardless of volume, to capture day-of-week and time-of-day patterns.
The Gradual Migration Path
Phase 1: Validation (10% traffic, 2-4 weeks)
Route 10% of traffic to the fine-tuned model. Monitor all metrics. The goal isn't to prove the fine-tuned model is better — it's to prove it's not significantly worse.
Pass criteria:
- Task completion rate within 3% of GPT-4
- No increase in error rate
- No increase in user complaints
- Format compliance meets your threshold
If the fine-tuned model fails these criteria, don't abandon it. Investigate the failures, add training examples for the failure patterns, retrain, and test again.
Phase 2: Expansion (50% traffic, 2-4 weeks)
If Phase 1 passes, increase to 50%. At this volume, you'll see rarer edge cases surface. Monitor the same metrics.
This phase also tests infrastructure: can your local hardware handle 50% of production traffic? Are there latency issues under load? Does Ollama behave well under sustained request volume?
Phase 3: Majority (90% traffic, 1-2 weeks)
Flip the ratio: 90% fine-tuned, 10% GPT-4. The GPT-4 path serves as a control group and fallback. This is your "soft launch" — the fine-tuned model handles the vast majority of traffic, but you still have a safety net.
Phase 4: Full Migration (100% traffic)
Disable the GPT-4 path. All traffic goes through your fine-tuned model. Keep GPT-4 credentials active for emergency rollback.
Total migration timeline: 6-12 weeks from first test to full migration. This timeline feels slow, but the confidence you gain is worth the patience. Shipping a bad model to 100% of users is far more expensive than an extra month of testing.
Common A/B Test Results
Based on typical fine-tuning outcomes:
Result 1: Fine-Tuned Model Wins on Domain Tasks
The most common result. The fine-tuned model outperforms GPT-4 on your specific domain tasks — better accuracy, more consistent format, fewer hallucinations about your product.
Action: Migrate to fine-tuned model. The cost savings are a bonus on top of quality improvement.
Result 2: Fine-Tuned Model Matches GPT-4
The fine-tuned model produces equivalent outcomes. Metrics are within noise of each other.
Action: Migrate to fine-tuned model. Equal quality at dramatically lower cost is an easy decision.
Result 3: Fine-Tuned Model Loses on Specific Scenarios
The fine-tuned model handles 90% of cases well but struggles with a specific subset — unusual queries, complex multi-step reasoning, or scenarios underrepresented in training data.
Action: Two options:
- Add training examples for the failure scenarios and retrain
- Use a hybrid approach: fine-tuned model for the 90%, GPT-4 fallback for the 10% (still massive cost savings)
Result 4: Fine-Tuned Model Is Significantly Worse
This is rare when you've done proper offline evaluation first. If it happens:
- Check your training data quality (data quality checklist)
- Verify quantization isn't causing issues (test at Q8_0 instead of Q4_K_M)
- Ensure the base model is appropriate for your task
- Consider whether your task genuinely needs frontier intelligence (when not to fine-tune)
Implementation Tips
Log Everything
Log every request and response for both variants. You'll need this data for:
- Debugging failures in the fine-tuned model
- Building training data for the next retraining cycle
- Proving to stakeholders that the migration was data-driven
Use the Same Temperature
Set temperature = 0 (or a fixed low value) for both models during the test. Variable temperature introduces noise that makes comparison harder.
Test the Full Pipeline
Don't just test model output quality. Test the full pipeline: request → model → response parsing → downstream action. A model that produces correct output in a slightly different format can break your parser.
Have a Rollback Plan
Keep the GPT-4 path operational even after full migration. If something goes wrong in production, you should be able to flip back to GPT-4 within minutes. This is a configuration change, not a deployment.
Communicate With Your Team
Let customer support, sales, and other teams know about the test. If users report quality differences, support needs to know which model served the response (check your logs).
Getting Started
- Fine-tune your model on Ertas and validate offline with your eval dataset
- Deploy via Ollama alongside your existing OpenAI integration
- Implement a simple router (10% Ollama / 90% OpenAI)
- Add logging for all metrics (task completion, satisfaction, errors, latency, cost)
- Run Phase 1 for 2-4 weeks
- Analyze results, adjust training if needed, and progress through phases
The A/B test removes the risk from migration. You're not guessing whether your fine-tuned model is good enough — you're measuring it. And in most cases, the data says: it's better AND cheaper.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

When Your SaaS Should Graduate from API Calls to Fine-Tuning
Your AI features work. Your API bill is growing faster than revenue. Here's the decision framework, cost math, and migration path for moving from per-token APIs to fine-tuned models — with real numbers at every step.

Multi-Tenant Fine-Tuning: Per-Customer AI Models in Your SaaS
Your SaaS customers want AI that understands their data, not generic responses. Here's how to architect per-tenant fine-tuned models using LoRA adapters — with real storage math, cost breakdowns, and a serving architecture that scales to hundreds of tenants.

Fine-Tuned AI for SaaS Customer Support Automation
Your RAG chatbot resolves 34% of support tickets. Fine-tuning pushes that to 87%. Here's how to build a support automation pipeline that actually works — with real numbers on resolution rates, cost per ticket, and the training data you need.