Back to blog
    Side-by-Side Model Comparison: How to Pick the Best Fine-Tuned Model Before Deploying
    evaluationmodel-comparisonfine-tuningquality-assurancedeploymentcanvas

    Side-by-Side Model Comparison: How to Pick the Best Fine-Tuned Model Before Deploying

    You fine-tuned three model variants. Which one ships to production? Automated metrics aren't enough — here's a systematic approach to comparing fine-tuned models side-by-side, with scoring rubrics and decision frameworks.

    EErtas Team·

    You've fine-tuned three model variants: one with 200 training examples, one with 500, and one with a different base model. The training loss curves look similar. Perplexity scores are within 5% of each other. Which one ships?

    Automated metrics tell part of the story. They don't tell the whole story. A model with 0.3% lower perplexity might hallucinate more. A model with higher training loss might produce more natural-sounding responses. A model that scores well on your benchmark might fail on the exact edge cases your users encounter most.

    The answer: run all three side-by-side on the same prompts and compare outputs systematically.

    Why Automated Metrics Aren't Enough

    Perplexity Measures Surprise, Not Quality

    Perplexity measures how "surprised" the model is by the test data. Lower perplexity generally means the model predicts the training distribution better. But a model that memorizes training data has great perplexity and terrible generalization.

    Loss Curves Show Training Progress, Not Production Fitness

    A smooth, converging loss curve means training went well. It doesn't mean the model handles your real-world inputs correctly. Overfitting shows up as great training metrics and poor production performance.

    BLEU/ROUGE Measure Overlap, Not Correctness

    These metrics compare generated text to reference text. They reward word overlap, not factual accuracy or task completion. A model that uses different (but correct) phrasing scores poorly. A model that copies training data verbatim scores well.

    Domain-Specific Quality Is Invisible to Generic Metrics

    If your model needs to use specific terminology, follow specific formats, or handle domain-specific edge cases, no generic metric captures this. Only domain-aware evaluation — which requires looking at actual outputs — tells you whether the model meets your production requirements.

    The Side-by-Side Method

    Step 1: Build Your Eval Dataset

    If you don't have one yet, build an evaluation dataset of 50-100 representative prompts. Include:

    • Common cases (60%): The bread-and-butter queries your model handles daily
    • Edge cases (20%): Unusual inputs, ambiguous requests, boundary conditions
    • Failure-prone cases (10%): Scenarios where previous models have failed
    • Adversarial cases (10%): Deliberately tricky inputs designed to expose weaknesses

    For each prompt, write the expected output — or at minimum, describe what a correct output looks like. See our guide on building eval datasets from conversations.

    Step 2: Run All Variants on the Same Prompts

    Feed every prompt in your eval dataset through every model variant. Capture all outputs. This must happen at the same quantization level and with the same inference parameters (temperature, top_p, etc.) — otherwise you're comparing configurations, not models.

    Ertas's canvas interface supports running multiple models simultaneously on the same prompt set, displaying outputs side-by-side for direct comparison.

    Step 3: Score Each Output

    For each prompt × model combination, score on these dimensions:

    DimensionWhat to checkScore
    AccuracyIs the factual content correct?1-5
    CompletenessDoes it cover all aspects of the query?1-5
    Format complianceDoes it follow the expected output structure?1-5
    Tone/styleDoes it match your brand voice or domain conventions?1-5
    HallucinationDoes it invent facts, cite nonexistent sources, or make up data?Binary (0/1)
    Edge case handlingDoes it handle the tricky cases correctly?1-5

    Step 4: Aggregate and Decide

    Calculate average scores per dimension per model:

    DimensionModel A (200 examples)Model B (500 examples)Model C (Qwen base)
    Accuracy4.14.34.0
    Completeness3.84.24.4
    Format compliance4.54.63.9
    Tone/style3.54.03.7
    Hallucination rate4%2%6%
    Edge case handling3.23.83.5
    Weighted total3.854.183.90

    In this example, Model B (500 examples, same base) wins across most dimensions. But the decision isn't always this clear.

    When the "Worse" Model Is Actually Better

    Sometimes the model with lower aggregate scores is the right production choice:

    Model A has the best format compliance but the worst tone. If your use case is structured data extraction (JSON output), format compliance matters more than tone. Pick Model A.

    Model C hallucinates more but handles edge cases better. If your use case is customer-facing Q&A where wrong answers are worse than no answers, the lower hallucination rate of Model B matters more than Model C's edge case handling.

    Model B scores best overall but is 2x the adapter size. If you're deploying to edge hardware with tight memory constraints, Model A's smaller adapter might be the practical choice despite lower scores.

    The scoring framework surfaces the tradeoffs. The decision depends on your specific priorities.

    Comparison Workflow for Agencies

    If you're delivering fine-tuned models to clients (QA before delivery), the side-by-side comparison doubles as client-facing quality evidence:

    1. Train 2-3 variants with different configurations
    2. Run the comparison using the client's own example queries
    3. Present the results — show the client actual outputs from each variant
    4. Let the client choose which variant best fits their needs
    5. Document the selection for model versioning

    This transparency builds trust. Clients see that you tested multiple approaches and selected the best one based on evidence, not guesswork.

    Tips for Effective Comparison

    Use Real Production Queries

    Don't just use your training data for evaluation (that's testing on the training set). Use queries from actual production usage, customer emails, or realistic scenarios your users would actually type.

    Test at Production Quantization

    If you'll deploy at Q4_K_M, evaluate at Q4_K_M. Quantization can affect output quality differently across models — a model that edges out another at F16 might lose at Q4.

    Include "No Good Answer" Prompts

    Include prompts where the correct response is "I don't know" or "I need more information." Models that always generate an answer (even when they shouldn't) are dangerous in production. The best model knows its limits.

    Don't Rely on a Single Evaluation Run

    LLM outputs have variance, especially at non-zero temperatures. Run each prompt 2-3 times and score the average. If a model produces great output 2 out of 3 times and terrible output once, that inconsistency matters.

    Weight Dimensions by Business Impact

    Not all dimensions matter equally. For a support chatbot, accuracy and tone matter most. For a data extraction pipeline, format compliance and hallucination rate matter most. Weight your scoring accordingly.

    Getting Started

    1. Build an eval dataset of 50-100 prompts (guide here)
    2. Fine-tune 2-3 model variants on Ertas (different data sizes, different base models, or different hyperparameters)
    3. Run all variants through the eval dataset on the canvas
    4. Score outputs on accuracy, completeness, format, tone, hallucination, and edge cases
    5. Aggregate scores and make a decision based on your priorities
    6. Deploy the winner and save the eval dataset for future retraining comparisons

    The eval dataset and scoring rubric you build now become permanent assets. Every time you retrain, you compare the new model against the same benchmark. Over time, you build a clear picture of model improvement — and you never ship a regression.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading