
Side-by-Side Model Comparison: How to Pick the Best Fine-Tuned Model Before Deploying
You fine-tuned three model variants. Which one ships to production? Automated metrics aren't enough — here's a systematic approach to comparing fine-tuned models side-by-side, with scoring rubrics and decision frameworks.
You've fine-tuned three model variants: one with 200 training examples, one with 500, and one with a different base model. The training loss curves look similar. Perplexity scores are within 5% of each other. Which one ships?
Automated metrics tell part of the story. They don't tell the whole story. A model with 0.3% lower perplexity might hallucinate more. A model with higher training loss might produce more natural-sounding responses. A model that scores well on your benchmark might fail on the exact edge cases your users encounter most.
The answer: run all three side-by-side on the same prompts and compare outputs systematically.
Why Automated Metrics Aren't Enough
Perplexity Measures Surprise, Not Quality
Perplexity measures how "surprised" the model is by the test data. Lower perplexity generally means the model predicts the training distribution better. But a model that memorizes training data has great perplexity and terrible generalization.
Loss Curves Show Training Progress, Not Production Fitness
A smooth, converging loss curve means training went well. It doesn't mean the model handles your real-world inputs correctly. Overfitting shows up as great training metrics and poor production performance.
BLEU/ROUGE Measure Overlap, Not Correctness
These metrics compare generated text to reference text. They reward word overlap, not factual accuracy or task completion. A model that uses different (but correct) phrasing scores poorly. A model that copies training data verbatim scores well.
Domain-Specific Quality Is Invisible to Generic Metrics
If your model needs to use specific terminology, follow specific formats, or handle domain-specific edge cases, no generic metric captures this. Only domain-aware evaluation — which requires looking at actual outputs — tells you whether the model meets your production requirements.
The Side-by-Side Method
Step 1: Build Your Eval Dataset
If you don't have one yet, build an evaluation dataset of 50-100 representative prompts. Include:
- Common cases (60%): The bread-and-butter queries your model handles daily
- Edge cases (20%): Unusual inputs, ambiguous requests, boundary conditions
- Failure-prone cases (10%): Scenarios where previous models have failed
- Adversarial cases (10%): Deliberately tricky inputs designed to expose weaknesses
For each prompt, write the expected output — or at minimum, describe what a correct output looks like. See our guide on building eval datasets from conversations.
Step 2: Run All Variants on the Same Prompts
Feed every prompt in your eval dataset through every model variant. Capture all outputs. This must happen at the same quantization level and with the same inference parameters (temperature, top_p, etc.) — otherwise you're comparing configurations, not models.
Ertas's canvas interface supports running multiple models simultaneously on the same prompt set, displaying outputs side-by-side for direct comparison.
Step 3: Score Each Output
For each prompt × model combination, score on these dimensions:
| Dimension | What to check | Score |
|---|---|---|
| Accuracy | Is the factual content correct? | 1-5 |
| Completeness | Does it cover all aspects of the query? | 1-5 |
| Format compliance | Does it follow the expected output structure? | 1-5 |
| Tone/style | Does it match your brand voice or domain conventions? | 1-5 |
| Hallucination | Does it invent facts, cite nonexistent sources, or make up data? | Binary (0/1) |
| Edge case handling | Does it handle the tricky cases correctly? | 1-5 |
Step 4: Aggregate and Decide
Calculate average scores per dimension per model:
| Dimension | Model A (200 examples) | Model B (500 examples) | Model C (Qwen base) |
|---|---|---|---|
| Accuracy | 4.1 | 4.3 | 4.0 |
| Completeness | 3.8 | 4.2 | 4.4 |
| Format compliance | 4.5 | 4.6 | 3.9 |
| Tone/style | 3.5 | 4.0 | 3.7 |
| Hallucination rate | 4% | 2% | 6% |
| Edge case handling | 3.2 | 3.8 | 3.5 |
| Weighted total | 3.85 | 4.18 | 3.90 |
In this example, Model B (500 examples, same base) wins across most dimensions. But the decision isn't always this clear.
When the "Worse" Model Is Actually Better
Sometimes the model with lower aggregate scores is the right production choice:
Model A has the best format compliance but the worst tone. If your use case is structured data extraction (JSON output), format compliance matters more than tone. Pick Model A.
Model C hallucinates more but handles edge cases better. If your use case is customer-facing Q&A where wrong answers are worse than no answers, the lower hallucination rate of Model B matters more than Model C's edge case handling.
Model B scores best overall but is 2x the adapter size. If you're deploying to edge hardware with tight memory constraints, Model A's smaller adapter might be the practical choice despite lower scores.
The scoring framework surfaces the tradeoffs. The decision depends on your specific priorities.
Comparison Workflow for Agencies
If you're delivering fine-tuned models to clients (QA before delivery), the side-by-side comparison doubles as client-facing quality evidence:
- Train 2-3 variants with different configurations
- Run the comparison using the client's own example queries
- Present the results — show the client actual outputs from each variant
- Let the client choose which variant best fits their needs
- Document the selection for model versioning
This transparency builds trust. Clients see that you tested multiple approaches and selected the best one based on evidence, not guesswork.
Tips for Effective Comparison
Use Real Production Queries
Don't just use your training data for evaluation (that's testing on the training set). Use queries from actual production usage, customer emails, or realistic scenarios your users would actually type.
Test at Production Quantization
If you'll deploy at Q4_K_M, evaluate at Q4_K_M. Quantization can affect output quality differently across models — a model that edges out another at F16 might lose at Q4.
Include "No Good Answer" Prompts
Include prompts where the correct response is "I don't know" or "I need more information." Models that always generate an answer (even when they shouldn't) are dangerous in production. The best model knows its limits.
Don't Rely on a Single Evaluation Run
LLM outputs have variance, especially at non-zero temperatures. Run each prompt 2-3 times and score the average. If a model produces great output 2 out of 3 times and terrible output once, that inconsistency matters.
Weight Dimensions by Business Impact
Not all dimensions matter equally. For a support chatbot, accuracy and tone matter most. For a data extraction pipeline, format compliance and hallucination rate matter most. Weight your scoring accordingly.
Getting Started
- Build an eval dataset of 50-100 prompts (guide here)
- Fine-tune 2-3 model variants on Ertas (different data sizes, different base models, or different hyperparameters)
- Run all variants through the eval dataset on the canvas
- Score outputs on accuracy, completeness, format, tone, hallucination, and edge cases
- Aggregate scores and make a decision based on your priorities
- Deploy the winner and save the eval dataset for future retraining comparisons
The eval dataset and scoring rubric you build now become permanent assets. Every time you retrain, you compare the new model against the same benchmark. Over time, you build a clear picture of model improvement — and you never ship a regression.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

How to Evaluate Your Fine-Tuned Model: A Non-Technical Guide
Practical framework for evaluating fine-tuned model quality without ML expertise — covering accuracy checks, output consistency, edge case testing, and production readiness for agencies and product teams.

Fine-Tuned Model Ops: The Complete Lifecycle Guide
The full lifecycle of fine-tuned models in production — from data preparation through deployment, monitoring, and retraining. Stage-by-stage breakdown with time estimates, maturity levels, and failure modes.

Building Reliable AI Agents with Fine-Tuned Local Models: Complete Guide
Most AI agents are just GPT-4 wrappers — expensive, unreliable at scale, and dependent on cloud APIs. Fine-tuned local models hit 98%+ accuracy on your specific tools at zero per-query cost. Here's the complete architecture.