You fine-tuned a model. Training completed without errors. The loss curve went down. Now what?

Most teams ship at this point. They run a few prompts manually, the outputs look reasonable, and the model goes to production. Two weeks later, a client reports the model is hallucinating product features that do not exist, or formatting responses in a way that breaks the downstream integration.

The problem is not the model. The problem is that "looks reasonable" is not an evaluation strategy.

Evaluation is the most skipped step in the fine-tuning pipeline. It is also the step that determines whether your model actually works in production or just works in your demo. This guide gives you five practical evaluation approaches that require zero ML expertise — just domain knowledge and a willingness to be systematic about it.

Why Evaluation Matters More Than Training

Here is a number that surprises most teams: a model can achieve excellent training metrics and still fail 15-30% of real production queries. Training loss measures how well the model learned your training data. It does not measure how well the model handles inputs it has never seen.

For agencies delivering models to clients, the gap between "trained successfully" and "works in production" is where reputations are made or broken. A single high-profile failure — a legal AI citing a non-existent statute, a healthcare bot giving incorrect dosage information — can undo months of relationship building.

Evaluation is not a nice-to-have quality step. It is the difference between a model you can confidently bill for and a model you are hoping works.

Approach 1: Human Review Sampling

The simplest and most underrated evaluation method. Pull 50-100 representative inputs from your expected production traffic, run them through the model, and have a domain expert review every output.

How to do it:

Collect 50-100 inputs that represent your actual use case. If the model handles customer support, use real support tickets. If it generates legal summaries, use real case briefs.
Run each input through your fine-tuned model and capture the output.
Have someone with domain knowledge rate each output on a simple scale: Correct, Partially Correct, or Wrong.
Calculate your accuracy rate. For most production use cases, you want 90%+ Correct and under 5% Wrong.

What this catches: Systematic errors that automated metrics miss. A model might score well on perplexity but consistently misuse industry-specific terminology. Human reviewers catch this immediately.

The 50-example minimum: Below 50 examples, your accuracy estimate has too much variance to be useful. At 50 examples, if you see 45 correct outputs, your true accuracy is likely between 82% and 97% (95% confidence interval). At 100 examples, that range tightens to 87-96%. More examples give you more confidence, but 50 is the floor for a meaningful signal.

Pro tip: Do not let the person who prepared the training data do the evaluation. They are too close to the expected outputs and will unconsciously rate borderline cases as correct. Fresh eyes catch more issues.

Approach 2: A/B Comparison Against Baseline

Side-by-side comparison is one of the most informative evaluation techniques, and it requires no statistical background.

How to do it:

Choose your baseline. This could be the base model before fine-tuning, a prompted GPT-4, or your previous model version.
Run the same 50-100 test inputs through both models.
Present the outputs side-by-side to a reviewer (blind — do not label which is which).
For each pair, the reviewer picks which output is better, or marks them as equal.
Count wins, losses, and ties.

Interpreting results: Your fine-tuned model should win at least 60% of head-to-head comparisons against the base model to justify deployment. If it wins less than 50%, something went wrong in training. If it wins 50-60%, the fine-tuning produced marginal improvement — consider whether the operational cost of maintaining a custom model is worth it.

What this catches: Regression. Fine-tuning can improve performance on your target task while degrading general capabilities. A/B comparison reveals whether the model got better at your specific task but worse at basic reasoning, grammar, or instruction following.

A common failure mode: the fine-tuned model nails the output format perfectly but the content quality drops. Without side-by-side comparison, you might not notice because the outputs look right at a glance.

Approach 3: Golden Test Set

A golden test set is a curated collection of inputs with known-correct outputs. It is the closest thing to a unit test suite for your model.

How to build one:

Start with 30-50 examples that cover your core use cases.
For each example, write the ideal output — the exact response you want the model to produce.
Include difficulty tiers: 60% straightforward cases, 25% moderate complexity, 15% hard edge cases.
Store this as a versioned file (JSONL works well) that you never use for training.

How to score it:

For classification tasks, accuracy is straightforward — the model either picks the right category or it does not. For generation tasks, scoring requires more nuance:

Exact match rate: What percentage of outputs match the golden answer exactly? Useful for structured outputs like JSON or category labels.
Semantic match rate: What percentage are functionally equivalent even if worded differently? Requires human judgment.
Key fact inclusion: For factual tasks, list the 3-5 facts each answer must include. Score the percentage of required facts present.

The critical rule: Never train on your golden test set. The moment test examples leak into training data, your evaluation becomes meaningless. Keep these files separate, and audit regularly to ensure no contamination.

Maintaining over time: Add 5-10 new examples monthly from real production failures. Cases where the model got it wrong in production are the most valuable test cases because they represent real gaps.

Approach 4: Edge Case Battery

Edge cases are where fine-tuned models fail most dramatically. A model can handle 95% of standard queries perfectly and completely fall apart on the remaining 5% — and those 5% are often the cases clients remember.

Build your edge case battery around these categories:

Ambiguous inputs. Queries that could be interpreted multiple ways. A well-behaved model should either ask for clarification or handle the most likely interpretation while acknowledging alternatives.

Out-of-scope inputs. Queries the model should not answer. If you fine-tuned a legal document summarizer, what happens when someone asks it to write marketing copy? The model should decline gracefully, not hallucinate a response.

Adversarial inputs. Inputs designed to break the model — prompt injection attempts, extremely long inputs, inputs in unexpected languages, inputs with contradictory information. You need 10-20 of these.

Boundary conditions. Inputs at the extremes of your expected range. The shortest possible valid input. The longest. Inputs with unusual formatting. Inputs that combine multiple sub-tasks.

How to run it:

Create a spreadsheet with 30-50 edge cases across these categories. For each, define the expected behavior (not necessarily a specific output, but what category of response is acceptable). Run them through the model and flag any case where the behavior is unexpected.

Pass criteria: Zero catastrophic failures (no offensive outputs, no dangerous advice, no data leakage). Graceful handling of at least 80% of edge cases. Identified failure modes documented for client communication.

Approach 5: Production Monitoring

Evaluation does not end at deployment. The most important evaluation happens in production, where real users generate inputs you never anticipated.

What to monitor:

Output length distribution. A sudden change in average output length often signals a problem. If your model typically generates 200-word responses and starts producing 50-word responses, something shifted.
Refusal rate. Track how often the model declines to answer. A spike in refusals might indicate the model is being too conservative, or that it is receiving out-of-distribution inputs.
Latency per request. Fine-tuned models should have consistent inference time. Latency spikes can indicate input handling issues.
User feedback signals. If your application includes thumbs up/down or retry behavior, track these. A retry rate above 15% suggests users are not satisfied with first-attempt outputs.
Error rate by input category. Break down performance by the type of query. You might find the model handles category A perfectly but struggles with category B — information that drives your next training data collection.

Sampling for ongoing review: Even after deployment, pull 20-30 random production outputs weekly for human review. This catches slow degradation that automated metrics miss. If your weekly accuracy drops below your baseline, investigate immediately.

Common Evaluation Mistakes

Mistake 1: Evaluating on training data. If your test examples overlap with training examples, your accuracy numbers are meaningless. The model is not demonstrating generalization — it is demonstrating memorization.

Mistake 2: Evaluating only happy-path inputs. Running 50 standard queries and seeing 50 correct outputs does not mean the model works. It means the model works on standard queries. Edge cases are where production failures live.

Mistake 3: Using a single metric. Accuracy alone does not tell you enough. A model with 90% accuracy that fails catastrophically on 2% of inputs (producing offensive or dangerous content) is worse than a model with 85% accuracy that fails gracefully.

Mistake 4: Evaluating once and shipping. Models do not degrade on their own, but production traffic does change over time. Monthly re-evaluation catches distribution drift before clients notice.

Mistake 5: Skipping evaluation because you trust the training data. Good training data is necessary but not sufficient. The model might learn the wrong patterns from correct data — overfitting to superficial features rather than the underlying task.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

How Ertas Studio Helps With Eval Workflows

Ertas Studio includes built-in evaluation tooling designed for teams without ML expertise:

Side-by-side comparison interface. Run test inputs through multiple model versions and compare outputs in a clean, reviewable format. No scripts required.

Golden test set management. Upload your test set once and re-run it against every new model version with a single click. Track accuracy trends across versions automatically.

Export evaluation reports. Generate shareable reports showing model performance across your test suite — useful for client presentations and internal sign-off.

The goal is to make evaluation as easy as training. If eval requires a Python script and a Jupyter notebook, most teams will skip it. If eval requires clicking a button and reviewing a table, most teams will actually do it.

How to Evaluate Your Fine-Tuned Model: A Non-Technical Guide

Why Evaluation Matters More Than Training

Approach 1: Human Review Sampling

Approach 2: A/B Comparison Against Baseline

Approach 3: Golden Test Set

Approach 4: Edge Case Battery

Approach 5: Production Monitoring

Common Evaluation Mistakes

How Ertas Studio Helps With Eval Workflows

Further Reading

Ship AI that runs on your users' devices.

Keep reading

Cleaning and Curating Datasets for Fine-Tuning Without a Data Science Team

Side-by-Side Model Comparison: How to Pick the Best Fine-Tuned Model Before Deploying

From Prompt Engineering to Fine-Tuning: The Migration Playbook