Side-by-Side Model Comparison: How to Pick the Best Fine-Tuned Model Before Deploying

You've fine-tuned three model variants: one with 200 training examples, one with 500, and one with a different base model. The training loss curves look similar. Perplexity scores are within 5% of each other. Which one ships?

Automated metrics tell part of the story. They don't tell the whole story. A model with 0.3% lower perplexity might hallucinate more. A model with higher training loss might produce more natural-sounding responses. A model that scores well on your benchmark might fail on the exact edge cases your users encounter most.

The answer: run all three side-by-side on the same prompts and compare outputs systematically.

Why Automated Metrics Aren't Enough

Perplexity Measures Surprise, Not Quality

Perplexity measures how "surprised" the model is by the test data. Lower perplexity generally means the model predicts the training distribution better. But a model that memorizes training data has great perplexity and terrible generalization.

Loss Curves Show Training Progress, Not Production Fitness

A smooth, converging loss curve means training went well. It doesn't mean the model handles your real-world inputs correctly. Overfitting shows up as great training metrics and poor production performance.

BLEU/ROUGE Measure Overlap, Not Correctness

These metrics compare generated text to reference text. They reward word overlap, not factual accuracy or task completion. A model that uses different (but correct) phrasing scores poorly. A model that copies training data verbatim scores well.

Domain-Specific Quality Is Invisible to Generic Metrics

If your model needs to use specific terminology, follow specific formats, or handle domain-specific edge cases, no generic metric captures this. Only domain-aware evaluation — which requires looking at actual outputs — tells you whether the model meets your production requirements.

The Side-by-Side Method

Step 1: Build Your Eval Dataset

If you don't have one yet, build an evaluation dataset of 50-100 representative prompts. Include:

Common cases (60%): The bread-and-butter queries your model handles daily
Edge cases (20%): Unusual inputs, ambiguous requests, boundary conditions
Failure-prone cases (10%): Scenarios where previous models have failed
Adversarial cases (10%): Deliberately tricky inputs designed to expose weaknesses

For each prompt, write the expected output — or at minimum, describe what a correct output looks like. See our guide on building eval datasets from conversations.

Step 2: Run All Variants on the Same Prompts

Feed every prompt in your eval dataset through every model variant. Capture all outputs. This must happen at the same quantization level and with the same inference parameters (temperature, top_p, etc.) — otherwise you're comparing configurations, not models.

Ertas's canvas interface supports running multiple models simultaneously on the same prompt set, displaying outputs side-by-side for direct comparison.

Step 3: Score Each Output

For each prompt × model combination, score on these dimensions:

Dimension	What to check	Score
Accuracy	Is the factual content correct?	1-5
Completeness	Does it cover all aspects of the query?	1-5
Format compliance	Does it follow the expected output structure?	1-5
Tone/style	Does it match your brand voice or domain conventions?	1-5
Hallucination	Does it invent facts, cite nonexistent sources, or make up data?	Binary (0/1)
Edge case handling	Does it handle the tricky cases correctly?	1-5

Step 4: Aggregate and Decide

Calculate average scores per dimension per model:

Dimension	Model A (200 examples)	Model B (500 examples)	Model C (Qwen base)
Accuracy	4.1	4.3	4.0
Completeness	3.8	4.2	4.4
Format compliance	4.5	4.6	3.9
Tone/style	3.5	4.0	3.7
Hallucination rate	4%	2%	6%
Edge case handling	3.2	3.8	3.5
Weighted total	3.85	4.18	3.90

In this example, Model B (500 examples, same base) wins across most dimensions. But the decision isn't always this clear.

When the "Worse" Model Is Actually Better

Sometimes the model with lower aggregate scores is the right production choice:

Model A has the best format compliance but the worst tone. If your use case is structured data extraction (JSON output), format compliance matters more than tone. Pick Model A.

Model C hallucinates more but handles edge cases better. If your use case is customer-facing Q&A where wrong answers are worse than no answers, the lower hallucination rate of Model B matters more than Model C's edge case handling.

Model B scores best overall but is 2x the adapter size. If you're deploying to edge hardware with tight memory constraints, Model A's smaller adapter might be the practical choice despite lower scores.

The scoring framework surfaces the tradeoffs. The decision depends on your specific priorities.

Comparison Workflow for Agencies

If you're delivering fine-tuned models to clients (QA before delivery), the side-by-side comparison doubles as client-facing quality evidence:

Train 2-3 variants with different configurations
Run the comparison using the client's own example queries
Present the results — show the client actual outputs from each variant
Let the client choose which variant best fits their needs
Document the selection for model versioning

This transparency builds trust. Clients see that you tested multiple approaches and selected the best one based on evidence, not guesswork.

Tips for Effective Comparison

Use Real Production Queries

Don't just use your training data for evaluation (that's testing on the training set). Use queries from actual production usage, customer emails, or realistic scenarios your users would actually type.

Test at Production Quantization

If you'll deploy at Q4_K_M, evaluate at Q4_K_M. Quantization can affect output quality differently across models — a model that edges out another at F16 might lose at Q4.

Include "No Good Answer" Prompts

Include prompts where the correct response is "I don't know" or "I need more information." Models that always generate an answer (even when they shouldn't) are dangerous in production. The best model knows its limits.

Don't Rely on a Single Evaluation Run

LLM outputs have variance, especially at non-zero temperatures. Run each prompt 2-3 times and score the average. If a model produces great output 2 out of 3 times and terrible output once, that inconsistency matters.

Weight Dimensions by Business Impact

Not all dimensions matter equally. For a support chatbot, accuracy and tone matter most. For a data extraction pipeline, format compliance and hallucination rate matter most. Weight your scoring accordingly.

Getting Started

Build an eval dataset of 50-100 prompts (guide here)
Fine-tune 2-3 model variants on Ertas (different data sizes, different base models, or different hyperparameters)
Run all variants through the eval dataset on the canvas
Score outputs on accuracy, completeness, format, tone, hallucination, and edge cases
Aggregate scores and make a decision based on your priorities
Deploy the winner and save the eval dataset for future retraining comparisons

The eval dataset and scoring rubric you build now become permanent assets. Every time you retrain, you compare the new model against the same benchmark. Over time, you build a clear picture of model improvement — and you never ship a regression.