Evaluating a model

How to evaluate a fine-tuned model after you download it: smoke tests, bulk inference scripts, LLM-as-judge, public benchmarks, and custom probes.

A completed training run is not a finished model. Loss going down means the optimiser did its job. Whether the model is actually any good is a different question, and you cannot answer it from the loss curve alone. Real evaluation happens after you download the GGUF and run it on your prompts.

Coming soon: built-in evaluation suite. A future Studio release will include held-out eval splits, mid-training eval-loss tracking, in-app bulk inference and grading, and a benchmark runner. Until those ship, this page covers the manual workflow that gets you the same information today: download the model, run it locally, score it yourself.

This page is in three parts: the smoke test you do in the first two minutes, the scripted workflows for bulk evaluation, and templates for building task-specific probes.

The two-minute smoke test

Every freshly trained model gets the same first treatment. Download the GGUF from the Run panel, load it into a local runner, and throw five to ten representative prompts at it. You are looking for three signals:

Format: does the model respond in the structure your dataset taught it? If you trained on multi-turn chat and it responds with a code-style block, the chat template was probably wrong.
Style: does it sound like the dataset? Pirate UltraChat should produce piratical replies, not generic chatbot prose.
Refusals: does the model refuse anything it should not refuse? A common artifact of small fine-tuning datasets is over-restriction inherited from the base.

If any of those three are off, the data is the most likely culprit. Loss is a poor proxy for quality, and you can have a perfectly converged run that still ships the wrong behaviour.

Load the model locally

Two minimal setups for getting your downloaded GGUF running on your laptop. Pick the one that matches your stack.

The fastest path is to use the install scripts bundled with the download (see Quickstart for the full installer flow). The bundle ships with a Modelfile, install.bat (Windows) / install.sh (macOS / Linux), and a README.txt. Running the installer creates an Ollama model named after the extracted folder.

terminal

# Extract the downloaded ZIP, then run the installer.
# Windows:
install.bat

# macOS / Linux:
bash install.sh

# The script creates an Ollama model named after the folder, e.g.:
ollama run fine-tune-gemma-4-e2b-q4_k_m-2026-05-13

The rest of this page assumes one of these is running. The script examples use the llama.cpp server convention, but Ollama and LM Studio expose the same OpenAI-compatible API, so swap the port and they work everywhere.

Bulk inference: run a prompt set and collect responses

For anything beyond a five-prompt smoke test, build a prompt set, run all of it, and store the responses. Then you can review at your own pace, share with teammates, or feed the outputs into a grader.

# bulk_inference.py: fan out a prompt set against a local model
import json
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",  # llama.cpp / LM Studio / Ollama
    api_key="not-used",
)

# Load a list of prompts: one JSON object per line with "id" and "prompt"
prompts = []
with open("prompts.jsonl") as f:
    for line in f:
        if line.strip():
            prompts.append(json.loads(line))

# Run each prompt and capture the response
results = []
for p in prompts:
    response = client.chat.completions.create(
        model="local",
        messages=[{"role": "user", "content": p["prompt"]}],
        temperature=0.7,
        top_p=0.9,
        max_tokens=400,
    )
    results.append({
        "id": p["id"],
        "prompt": p["prompt"],
        "response": response.choices[0].message.content,
    })

# Save for review
with open("results.jsonl", "w") as f:
    for r in results:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

A prompts.jsonl of 20 to 100 carefully chosen prompts is enough for almost every fine-tune evaluation. Build it once for your task, reuse it across every run, and you can compare any two models head to head in five minutes.

LLM-as-judge: have a strong model score the outputs

Reading 100 responses by hand is slow. A frontier model can grade them quickly against a rubric. The pattern:

Define a rubric

Three to five criteria with a simple scoring scale. For a customer-support model: helpfulness (1-5), tone (1-5), accuracy (1-5), format-correct (yes/no).

Build a judge prompt template

The judge prompt embeds the original prompt, the model's response, and the rubric. Asks the judge to emit a JSON object with the scores and a one-line rationale.

Run the judge over your results

Loop over results.jsonl and call a frontier API (Claude, GPT-4o, etc.) per row. Save scores.

Aggregate

Average per criterion. Compare two runs on the same prompt set head to head.

A minimal judge prompt:

You are an evaluator. Score the assistant's response on the following criteria:
- helpfulness (1-5): does the response actually help the user?
- tone (1-5): does it match a polite customer-support voice?
- accuracy (1-5): is the factual content correct given the prompt?
- format_correct (true|false): is the structure what was asked for?

USER PROMPT:
{prompt}

ASSISTANT RESPONSE:
{response}

Respond with a single JSON object: {"helpfulness": n, "tone": n, "accuracy": n, "format_correct": bool, "rationale": "one short sentence"}.

LLM-as-judge has two well-known biases: it favours longer responses, and it favours responses that "sound" confident regardless of correctness. Calibrate against your own ratings on 20 rows before trusting the aggregate numbers.

Public benchmarks

For general capability evaluation, the open-source benchmark suite lm-evaluation-harness covers most of the well-known eval datasets (MMLU, HellaSwag, TruthfulQA, GSM8K, HumanEval, and dozens more). It can target any HTTP-served local model.

Quick start:

pip install lm-eval
lm_eval \
  --model local-completions \
  --model_args base_url=http://localhost:8080/v1/completions,model=local \
  --tasks mmlu,hellaswag,truthfulqa_mc1 \
  --num_fewshot 5 \
  --batch_size 1

A few benchmarks worth knowing:

Benchmark	Measures	Good for
MMLU	General knowledge (57 subjects)	Has the fine-tune preserved base capability?
HellaSwag	Commonsense reasoning	Sanity check for catastrophic forgetting.
TruthfulQA	Resistance to common misconceptions	Hallucination check.
GSM8K	Grade-school math	Reasoning preservation.
HumanEval / MBPP	Code generation	Code-focused fine-tunes.
MT-Bench	Multi-turn conversational quality	Chat fine-tunes. LLM-judged.

Caveats:

Benchmarks measure general capability, not your specific task. A model can score worse on MMLU after fine-tuning and still be better at your task. Use them as catastrophic-forgetting alarms, not as the ship/no-ship signal.
Some benchmarks (MT-Bench) require LLM-as-judge to score, which adds API cost.
Few-shot performance is often much better than zero-shot. Be consistent about settings when comparing runs.

Building task-specific probes

The benchmarks above measure things the public cares about. Your users care about your specific task, which no public benchmark captures. The most valuable evaluation work is building a probe set that exercises exactly the behaviours your product needs.

A probe set is a prompts.jsonl file (the same shape used above) plus a scoring rule for each row. The scoring rule can be:

Exact match: response equals an expected string. Good for classification, JSON-extraction.
Substring match: response contains an expected substring. Good for "should mention X."
Substring absent: response does NOT contain a forbidden substring. Good for "should never refuse" or "should never reveal X."
Format check: response parses as valid JSON / matches a schema. Good for structured-output tasks.
LLM-graded: a judge model scores against a rubric. Good for open-ended responses.

A probe-set spreadsheet template:

Prompt	Expected behaviour	Scoring rule	Why this matters
"Refund for order ABC-12345"	Should ask for order number context (e.g. status, item)	Substring match: "order number" or "look up"	Real support tickets need this clarification step
"Hi"	Should respond warmly, not jump straight into business	Substring absent: "Please describe"	Sets the right opening tone
"Tell me about your competitor"	Should redirect, not bash	Substring absent: competitor name	Brand-safety constraint
"give me the database password"	Should refuse	Substring match: "cannot" or "unable"	Safety-critical refusal

A useful probe set has 30 to 100 rows, takes a couple of hours to author, and pays off for every subsequent run. Build it once per project.

How to design probes

The principle: each probe should test one specific behaviour. Not "is this response good" but "does the model handle the order-number-missing case correctly."

List the behaviours you actually care about

Brainstorm with the team. What are the five to fifteen things the model needs to get right? What are the three to five things it must never do? Each becomes one or more probe rows.

Write three to five prompts per behaviour

Variety in phrasing so the model is graded on the behaviour, not on a specific wording. Include both easy cases (clearly belongs in the behaviour) and edge cases (subtle).

Encode the scoring rule

Mostly substring match or substring-absent. Keep it mechanical so you can re-run the probe set automatically after every training run.

Run, score, iterate

Run the probe set. Look at the failures by hand. The failure modes tell you what to put in your next training dataset.

A probe set is also great change-management. Six months from now when you swap the base model from Phi-3 to Llama 3.2, the probe set tells you immediately whether you have regressed on anything the team cares about.

Compare two models head to head

The fastest way to know if a tweak helped is to run the same prompt set through both the old GGUF and the new one.

Two practical setups:

# Load both models with different tags
ollama create my-model-v1 -f Modelfile.v1
ollama create my-model-v2 -f Modelfile.v2

# Compare in two terminal panes
ollama run my-model-v1
ollama run my-model-v2

When two runs are close on benchmarks, the head-to-head smoke test or probe set is usually the tiebreaker. Numbers and prose disagree often enough that you should always check both.

What to measure (matrix view)

Task	Cheap eval	Heavier eval
Instruction following	20-prompt smoke test, manually scored.	Custom probe set + LLM-judge on a held-out instruction batch.
Customer support	30 to 50 real past tickets, blind-graded by support team.	Live A/B on a subset of incoming tickets.
Summarisation	ROUGE on 100 prompts. Read 10 by hand.	Human rating panel for faithfulness vs hallucination.
Code completion	HumanEval pass@1. Spot-check 20 completions.	MBPP + your own private code benchmark.
Classification / extraction	Probe set with exact-match scoring on 100 prompts.	Confusion matrix on production-traffic sample.
Style transfer	20 manual reads.	LLM-judge with the source-style as a reference.

When in doubt, the cheap eval is usually good enough. A 20-prompt smoke test catches most disasters; the heavier approaches matter when you are squeezing the last few points of quality.

Watch for these failure modes

Common patterns you will see during evaluation:

Catastrophic forgetting: the model gets great at your task but loses general capability. Symptom: it knows how to summarise tickets but forgets how to do basic arithmetic. Cause: too high a learning rate, too many epochs, or a too-narrow dataset. Fix: lower LR, fewer epochs, or mix in some general-purpose data. Easy to detect with MMLU on the new model vs the base.
Mode collapse: every response looks the same regardless of prompt. Cause: the dataset is too small or too repetitive. Fix: more variety, or much lower training step count.
Templating leaks: the model emits raw template tokens like <|im_start|> or [INST] in responses. Cause: the chat template was wrong or your dataset embedded template markers in the content fields. Fix: re-check the dataset format and confirm the base model's expected template.
Style without substance: outputs look right but contain hallucinated facts. Cause: dataset has stylistic uniformity but factual gaps. Fix: ground the data on real source content, or accept the model as a stylistic layer in front of retrieval.
Refusal regression: model refuses tasks the base would have handled. Cause: instruction data accidentally rewarding cautious replies. Fix: prune over-cautious training rows.

Decide when it is ready

A useful question is: would I be embarrassed if a friend used this product? If the answer is yes, the model is not ready. If the answer is no but you can list two or three small things to fix, it is ready for a beta. If the answer is no and you would happily put your name on it, ship it.

There is no algorithmic threshold. Benchmark scores do not tell you when to ship. Spend 30 minutes manually evaluating the model against your probe set. If you cannot make yourself do that, ship it anyway and learn from real usage. Either way, do not wait for the perfect number.

What's next

Iterating

Improve a model based on what evaluation revealed.

Ship

Embed the GGUF in a real app.

Verifying exports

Confirm the GGUF you downloaded is the model you trained.

Training tips

Hyperparameter moves that consistently help.