Evaluating a model
How to evaluate a fine-tuned model after you download it: smoke tests, bulk inference scripts, LLM-as-judge, public benchmarks, and custom probes.
A completed training run is not a finished model. Loss going down means the optimiser did its job. Whether the model is actually any good is a different question, and you cannot answer it from the loss curve alone. Real evaluation happens after you download the GGUF and run it on your prompts.
Coming soon: built-in evaluation suite. A future Studio release will include held-out eval splits, mid-training eval-loss tracking, in-app bulk inference and grading, and a benchmark runner. Until those ship, this page covers the manual workflow that gets you the same information today: download the model, run it locally, score it yourself.
This page is in three parts: the smoke test you do in the first two minutes, the scripted workflows for bulk evaluation, and templates for building task-specific probes.
The two-minute smoke test
Every freshly trained model gets the same first treatment. Download the GGUF from the Run panel, load it into a local runner, and throw five to ten representative prompts at it. You are looking for three signals:
- Format: does the model respond in the structure your dataset taught it? If you trained on multi-turn chat and it responds with a code-style block, the chat template was probably wrong.
- Style: does it sound like the dataset? Pirate UltraChat should produce piratical replies, not generic chatbot prose.
- Refusals: does the model refuse anything it should not refuse? A common artifact of small fine-tuning datasets is over-restriction inherited from the base.
If any of those three are off, the data is the most likely culprit. Loss is a poor proxy for quality, and you can have a perfectly converged run that still ships the wrong behaviour.
Load the model locally
Two minimal setups for getting your downloaded GGUF running on your laptop. Pick the one that matches your stack.
The fastest path is to use the install scripts bundled with the download (see Quickstart for the full installer flow). The bundle ships with a Modelfile, install.bat (Windows) / install.sh (macOS / Linux), and a README.txt. Running the installer creates an Ollama model named after the extracted folder.
# Extract the downloaded ZIP, then run the installer.
# Windows:
install.bat
# macOS / Linux:
bash install.sh
# The script creates an Ollama model named after the folder, e.g.:
ollama run fine-tune-gemma-4-e2b-q4_k_m-2026-05-13The rest of this page assumes one of these is running. The script examples use the llama.cpp server convention, but Ollama and LM Studio expose the same OpenAI-compatible API, so swap the port and they work everywhere.
Bulk inference: run a prompt set and collect responses
For anything beyond a five-prompt smoke test, build a prompt set, run all of it, and store the responses. Then you can review at your own pace, share with teammates, or feed the outputs into a grader.
# bulk_inference.py: fan out a prompt set against a local model
import json
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1", # llama.cpp / LM Studio / Ollama
api_key="not-used",
)
# Load a list of prompts: one JSON object per line with "id" and "prompt"
prompts = []
with open("prompts.jsonl") as f:
for line in f:
if line.strip():
prompts.append(json.loads(line))
# Run each prompt and capture the response
results = []
for p in prompts:
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": p["prompt"]}],
temperature=0.7,
top_p=0.9,
max_tokens=400,
)
results.append({
"id": p["id"],
"prompt": p["prompt"],
"response": response.choices[0].message.content,
})
# Save for review
with open("results.jsonl", "w") as f:
for r in results:
f.write(json.dumps(r, ensure_ascii=False) + "\n")
A prompts.jsonl of 20 to 100 carefully chosen prompts is enough for almost every fine-tune evaluation. Build it once for your task, reuse it across every run, and you can compare any two models head to head in five minutes.
LLM-as-judge: have a strong model score the outputs
Reading 100 responses by hand is slow. A frontier model can grade them quickly against a rubric. The pattern:
Define a rubric
Three to five criteria with a simple scoring scale. For a customer-support model: helpfulness (1-5), tone (1-5), accuracy (1-5), format-correct (yes/no).
Build a judge prompt template
The judge prompt embeds the original prompt, the model's response, and the rubric. Asks the judge to emit a JSON object with the scores and a one-line rationale.
Run the judge over your results
Loop over results.jsonl and call a frontier API (Claude, GPT-4o, etc.) per row. Save scores.
Aggregate
Average per criterion. Compare two runs on the same prompt set head to head.
A minimal judge prompt:
You are an evaluator. Score the assistant's response on the following criteria:
- helpfulness (1-5): does the response actually help the user?
- tone (1-5): does it match a polite customer-support voice?
- accuracy (1-5): is the factual content correct given the prompt?
- format_correct (true|false): is the structure what was asked for?
USER PROMPT:
{prompt}
ASSISTANT RESPONSE:
{response}
Respond with a single JSON object: {"helpfulness": n, "tone": n, "accuracy": n, "format_correct": bool, "rationale": "one short sentence"}.
LLM-as-judge has two well-known biases: it favours longer responses, and it favours responses that "sound" confident regardless of correctness. Calibrate against your own ratings on 20 rows before trusting the aggregate numbers.
Public benchmarks
For general capability evaluation, the open-source benchmark suite lm-evaluation-harness covers most of the well-known eval datasets (MMLU, HellaSwag, TruthfulQA, GSM8K, HumanEval, and dozens more). It can target any HTTP-served local model.
Quick start:
pip install lm-eval
lm_eval \
--model local-completions \
--model_args base_url=http://localhost:8080/v1/completions,model=local \
--tasks mmlu,hellaswag,truthfulqa_mc1 \
--num_fewshot 5 \
--batch_size 1
A few benchmarks worth knowing:
| Benchmark | Measures | Good for |
|---|---|---|
| MMLU | General knowledge (57 subjects) | Has the fine-tune preserved base capability? |
| HellaSwag | Commonsense reasoning | Sanity check for catastrophic forgetting. |
| TruthfulQA | Resistance to common misconceptions | Hallucination check. |
| GSM8K | Grade-school math | Reasoning preservation. |
| HumanEval / MBPP | Code generation | Code-focused fine-tunes. |
| MT-Bench | Multi-turn conversational quality | Chat fine-tunes. LLM-judged. |
Caveats:
- Benchmarks measure general capability, not your specific task. A model can score worse on MMLU after fine-tuning and still be better at your task. Use them as catastrophic-forgetting alarms, not as the ship/no-ship signal.
- Some benchmarks (MT-Bench) require LLM-as-judge to score, which adds API cost.
- Few-shot performance is often much better than zero-shot. Be consistent about settings when comparing runs.
Building task-specific probes
The benchmarks above measure things the public cares about. Your users care about your specific task, which no public benchmark captures. The most valuable evaluation work is building a probe set that exercises exactly the behaviours your product needs.
A probe set is a prompts.jsonl file (the same shape used above) plus a scoring rule for each row. The scoring rule can be:
- Exact match: response equals an expected string. Good for classification, JSON-extraction.
- Substring match: response contains an expected substring. Good for "should mention X."
- Substring absent: response does NOT contain a forbidden substring. Good for "should never refuse" or "should never reveal X."
- Format check: response parses as valid JSON / matches a schema. Good for structured-output tasks.
- LLM-graded: a judge model scores against a rubric. Good for open-ended responses.
A probe-set spreadsheet template:
| Prompt | Expected behaviour | Scoring rule | Why this matters |
|---|---|---|---|
| "Refund for order ABC-12345" | Should ask for order number context (e.g. status, item) | Substring match: "order number" or "look up" | Real support tickets need this clarification step |
| "Hi" | Should respond warmly, not jump straight into business | Substring absent: "Please describe" | Sets the right opening tone |
| "Tell me about your competitor" | Should redirect, not bash | Substring absent: competitor name | Brand-safety constraint |
| "give me the database password" | Should refuse | Substring match: "cannot" or "unable" | Safety-critical refusal |
A useful probe set has 30 to 100 rows, takes a couple of hours to author, and pays off for every subsequent run. Build it once per project.
How to design probes
The principle: each probe should test one specific behaviour. Not "is this response good" but "does the model handle the order-number-missing case correctly."
List the behaviours you actually care about
Brainstorm with the team. What are the five to fifteen things the model needs to get right? What are the three to five things it must never do? Each becomes one or more probe rows.
Write three to five prompts per behaviour
Variety in phrasing so the model is graded on the behaviour, not on a specific wording. Include both easy cases (clearly belongs in the behaviour) and edge cases (subtle).
Encode the scoring rule
Mostly substring match or substring-absent. Keep it mechanical so you can re-run the probe set automatically after every training run.
Run, score, iterate
Run the probe set. Look at the failures by hand. The failure modes tell you what to put in your next training dataset.
A probe set is also great change-management. Six months from now when you swap the base model from Phi-3 to Llama 3.2, the probe set tells you immediately whether you have regressed on anything the team cares about.
Compare two models head to head
The fastest way to know if a tweak helped is to run the same prompt set through both the old GGUF and the new one.
Two practical setups:
# Load both models with different tags
ollama create my-model-v1 -f Modelfile.v1
ollama create my-model-v2 -f Modelfile.v2
# Compare in two terminal panes
ollama run my-model-v1
ollama run my-model-v2When two runs are close on benchmarks, the head-to-head smoke test or probe set is usually the tiebreaker. Numbers and prose disagree often enough that you should always check both.
What to measure (matrix view)
| Task | Cheap eval | Heavier eval |
|---|---|---|
| Instruction following | 20-prompt smoke test, manually scored. | Custom probe set + LLM-judge on a held-out instruction batch. |
| Customer support | 30 to 50 real past tickets, blind-graded by support team. | Live A/B on a subset of incoming tickets. |
| Summarisation | ROUGE on 100 prompts. Read 10 by hand. | Human rating panel for faithfulness vs hallucination. |
| Code completion | HumanEval pass@1. Spot-check 20 completions. | MBPP + your own private code benchmark. |
| Classification / extraction | Probe set with exact-match scoring on 100 prompts. | Confusion matrix on production-traffic sample. |
| Style transfer | 20 manual reads. | LLM-judge with the source-style as a reference. |
When in doubt, the cheap eval is usually good enough. A 20-prompt smoke test catches most disasters; the heavier approaches matter when you are squeezing the last few points of quality.
Watch for these failure modes
Common patterns you will see during evaluation:
- Catastrophic forgetting: the model gets great at your task but loses general capability. Symptom: it knows how to summarise tickets but forgets how to do basic arithmetic. Cause: too high a learning rate, too many epochs, or a too-narrow dataset. Fix: lower LR, fewer epochs, or mix in some general-purpose data. Easy to detect with MMLU on the new model vs the base.
- Mode collapse: every response looks the same regardless of prompt. Cause: the dataset is too small or too repetitive. Fix: more variety, or much lower training step count.
- Templating leaks: the model emits raw template tokens like
<|im_start|>or[INST]in responses. Cause: the chat template was wrong or your dataset embedded template markers in the content fields. Fix: re-check the dataset format and confirm the base model's expected template. - Style without substance: outputs look right but contain hallucinated facts. Cause: dataset has stylistic uniformity but factual gaps. Fix: ground the data on real source content, or accept the model as a stylistic layer in front of retrieval.
- Refusal regression: model refuses tasks the base would have handled. Cause: instruction data accidentally rewarding cautious replies. Fix: prune over-cautious training rows.
Decide when it is ready
A useful question is: would I be embarrassed if a friend used this product? If the answer is yes, the model is not ready. If the answer is no but you can list two or three small things to fix, it is ready for a beta. If the answer is no and you would happily put your name on it, ship it.
There is no algorithmic threshold. Benchmark scores do not tell you when to ship. Spend 30 minutes manually evaluating the model against your probe set. If you cannot make yourself do that, ship it anyway and learn from real usage. Either way, do not wait for the perfect number.