A/B Testing Your Fine-Tuned Model Against GPT-4 in Production

You've fine-tuned a model on your domain data. The eval results look promising. Offline tests show accuracy matching GPT-4 on your specific tasks. But deploying to production feels risky — what if it's worse for real users in ways your eval dataset didn't capture?

The answer isn't guessing. It's A/B testing. Route a percentage of production traffic to your fine-tuned model, measure outcomes, and migrate gradually as confidence builds.

The Routing Architecture

The simplest implementation: an API gateway that routes requests to either your cloud API or your local fine-tuned model based on a split percentage.

User Request
     ↓
API Gateway (router)
     ↓
┌─────────────────────┐
│  10% → Ollama       │  (fine-tuned model, local)
│  90% → OpenAI API   │  (GPT-4o, cloud)
└─────────────────────┘
     ↓
Log: model_used, input, output, metrics
     ↓
Response to User

Implementation options:

Simple: Random assignment per request (Math.random() < 0.1 → Ollama, else → OpenAI)
Better: Consistent assignment per user session (same user always gets the same model during a session, preventing inconsistent behavior)
Best: Feature flag service (LaunchDarkly, PostHog, custom) that lets you adjust the split without deployment

Both Ollama and OpenAI expose OpenAI-compatible APIs, so the router's job is just URL switching. The request format is identical.

What to Measure

Primary Metrics (Business Outcomes)

These tell you whether the fine-tuned model produces the same user outcomes as GPT-4:

Metric	What it measures	How to capture
Task completion rate	Did the user accomplish what they came to do?	Track downstream actions (e.g., user clicks "resolved" after support response)
User satisfaction	Is the user happy with the AI output?	Thumbs up/down on responses, NPS, or CSAT
Error rate	Did the model produce outputs requiring human correction?	Track manual overrides or corrections
Conversion impact	Does the AI feature drive the desired business action?	Track conversion events downstream of AI interaction

Secondary Metrics (Model Quality)

These tell you about the model's technical performance:

Metric	What it measures
Response latency	Time from request to complete response (local should be faster)
Format compliance	Percentage of responses matching expected output structure
Hallucination rate	Factual errors in responses (requires spot-check review)
Fallback rate	How often the model produces a non-response or error

Cost Metrics

Metric	GPT-4o	Fine-tuned local
Cost per request	$0.003-0.01	~$0
Monthly cost (at test volume)	Track actual spend	Hardware cost (fixed)
Projected monthly cost at 100%	Extrapolate from test	Same fixed cost

How Many Queries You Need

A/B tests require statistical significance. The number of queries needed depends on the difference you're trying to detect and your current metrics:

Rule of thumb: To detect a 5% difference in task completion rate with 95% confidence:

If current rate is 80%: ~1,500 queries per variant
If current rate is 90%: ~3,500 queries per variant
If current rate is 95%: ~7,300 queries per variant

At 500 queries per day, a 10% split sends ~50 queries/day to the fine-tuned model. At that rate:

Detecting a 5% difference: ~30-70 days
Detecting a 10% difference: ~7-18 days

Practical guidance: Run the test for at least 2 weeks regardless of volume, to capture day-of-week and time-of-day patterns.

The Gradual Migration Path

Phase 1: Validation (10% traffic, 2-4 weeks)

Route 10% of traffic to the fine-tuned model. Monitor all metrics. The goal isn't to prove the fine-tuned model is better — it's to prove it's not significantly worse.

Pass criteria:

Task completion rate within 3% of GPT-4
No increase in error rate
No increase in user complaints
Format compliance meets your threshold

If the fine-tuned model fails these criteria, don't abandon it. Investigate the failures, add training examples for the failure patterns, retrain, and test again.

Phase 2: Expansion (50% traffic, 2-4 weeks)

If Phase 1 passes, increase to 50%. At this volume, you'll see rarer edge cases surface. Monitor the same metrics.

This phase also tests infrastructure: can your local hardware handle 50% of production traffic? Are there latency issues under load? Does Ollama behave well under sustained request volume?

Phase 3: Majority (90% traffic, 1-2 weeks)

Flip the ratio: 90% fine-tuned, 10% GPT-4. The GPT-4 path serves as a control group and fallback. This is your "soft launch" — the fine-tuned model handles the vast majority of traffic, but you still have a safety net.

Phase 4: Full Migration (100% traffic)

Disable the GPT-4 path. All traffic goes through your fine-tuned model. Keep GPT-4 credentials active for emergency rollback.

Total migration timeline: 6-12 weeks from first test to full migration. This timeline feels slow, but the confidence you gain is worth the patience. Shipping a bad model to 100% of users is far more expensive than an extra month of testing.

Common A/B Test Results

Based on typical fine-tuning outcomes:

Result 1: Fine-Tuned Model Wins on Domain Tasks

The most common result. The fine-tuned model outperforms GPT-4 on your specific domain tasks — better accuracy, more consistent format, fewer hallucinations about your product.

Action: Migrate to fine-tuned model. The cost savings are a bonus on top of quality improvement.

Result 2: Fine-Tuned Model Matches GPT-4

The fine-tuned model produces equivalent outcomes. Metrics are within noise of each other.

Action: Migrate to fine-tuned model. Equal quality at dramatically lower cost is an easy decision.

Result 3: Fine-Tuned Model Loses on Specific Scenarios

The fine-tuned model handles 90% of cases well but struggles with a specific subset — unusual queries, complex multi-step reasoning, or scenarios underrepresented in training data.

Action: Two options:

Add training examples for the failure scenarios and retrain
Use a hybrid approach: fine-tuned model for the 90%, GPT-4 fallback for the 10% (still massive cost savings)

Result 4: Fine-Tuned Model Is Significantly Worse

This is rare when you've done proper offline evaluation first. If it happens:

Check your training data quality (data quality checklist)
Verify quantization isn't causing issues (test at Q8_0 instead of Q4_K_M)
Ensure the base model is appropriate for your task
Consider whether your task genuinely needs frontier intelligence (when not to fine-tune)

Implementation Tips

Log Everything

Log every request and response for both variants. You'll need this data for:

Debugging failures in the fine-tuned model
Building training data for the next retraining cycle
Proving to stakeholders that the migration was data-driven

Use the Same Temperature

Set temperature = 0 (or a fixed low value) for both models during the test. Variable temperature introduces noise that makes comparison harder.

Test the Full Pipeline

Don't just test model output quality. Test the full pipeline: request → model → response parsing → downstream action. A model that produces correct output in a slightly different format can break your parser.

Have a Rollback Plan

Keep the GPT-4 path operational even after full migration. If something goes wrong in production, you should be able to flip back to GPT-4 within minutes. This is a configuration change, not a deployment.

Communicate With Your Team

Let customer support, sales, and other teams know about the test. If users report quality differences, support needs to know which model served the response (check your logs).

Getting Started

Fine-tune your model on Ertas and validate offline with your eval dataset
Deploy via Ollama alongside your existing OpenAI integration
Implement a simple router (10% Ollama / 90% OpenAI)
Add logging for all metrics (task completion, satisfaction, errors, latency, cost)
Run Phase 1 for 2-4 weeks
Analyze results, adjust training if needed, and progress through phases

The A/B test removes the risk from migration. You're not guessing whether your fine-tuned model is good enough — you're measuring it. And in most cases, the data says: it's better AND cheaper.