How to QA a Fine-Tuned Model Before Client Delivery

Traditional software is deterministic. You write a test, it passes or fails, and you ship with confidence. AI models don't work that way. The same input can produce different outputs on different runs. "Correct" is a spectrum, not a binary. And the failure mode isn't a crash — it's a plausible-sounding answer that's subtly, dangerously wrong.

This is why QA for fine-tuned models is both more important and more difficult than QA for traditional software. Skip it and you'll deliver models that embarrass you in front of clients. Over-engineer it and you'll spend more time testing than building.

This guide is the practical middle ground: a 4-phase QA process that takes 4-8 hours per model and catches the problems that matter.

Why "Just Run the Tests" Doesn't Work

Before we get to the process, let's be clear about what makes AI QA different:

Non-determinism. Run the same prompt through the same model 10 times and you might get 10 different outputs. All of them might be acceptable. Or 9 might be great and 1 might be terrible. Your QA process needs to account for this variance.

Subjective quality. Is "The refund will be processed in 3-5 business days" better or worse than "Your refund should arrive within a week"? Both are factually correct. The answer depends on the client's brand voice, their customers' expectations, and their internal policies.

Distribution shift. Your model might score 96% on your test set but fail on inputs that look nothing like your test data. Real-world inputs are messier, more ambiguous, and more adversarial than any test set you'll create.

Cascading effects. A model that generates slightly wrong metadata in a pipeline can cause downstream systems to break in ways that look completely unrelated to the model.

You need a QA process that handles all of these. Here's ours.

Phase 1: Automated Evaluation (1-2 Hours)

This is the baseline. If your model fails automated eval, nothing else matters — go back and retrain.

The Golden Test Set

Every client model needs a golden test set: 100-500 examples of inputs and expected outputs that represent the core use cases. These should be:

Representative. Cover all major input categories the model will see in production.
Labeled by experts. Not generated by another AI model. Human-labeled ground truth.
Versioned. The test set evolves as the client's needs change. Track which version you evaluated against.
Stratified. Include proportional examples of easy, medium, and hard cases.

Metrics to Compute

Run the golden test set through the new model and compute:

Accuracy or task-specific correctness. For classification tasks, this is straightforward accuracy or F1. For generation tasks, use a combination of ROUGE scores and semantic similarity to reference outputs. Target: >90% for most use cases.

Hallucination rate. Count outputs that contain factual claims not supported by the input or the model's knowledge base. For grounded tasks (RAG, extraction), this should be below 5%. For open-ended generation, below 10%.

Format compliance. If the model should output JSON, how often does it produce valid JSON? If it should follow a template, how often does it match? Target: >99%.

Latency. Measure P50 and P95 latency on the target hardware. If P95 exceeds your SLA (typically 2 seconds for real-time use cases), you have a problem.

Regression Check

Compare every metric against the previous version. If any metric dropped by more than 2 percentage points, flag it. Regressions are common — a model that gets better at handling one category often gets slightly worse at another.

Create a comparison table:

Metric              | Previous (v1.2) | New (v1.3) | Delta
--------------------|-----------------|------------|------
Accuracy            | 93.4%           | 95.1%      | +1.7%
Hallucination rate  | 3.2%            | 2.8%       | -0.4%
Format compliance   | 99.1%           | 99.4%      | +0.3%
P95 latency         | 1.4s            | 1.3s       | -0.1s
Refund category acc | 91.2%           | 88.7%      | -2.5% ⚠️

That refund category regression? It might be acceptable in context (overall accuracy improved), or it might be a dealbreaker if the client specifically asked you to improve refund handling. This is why automated eval alone isn't enough.

Phase 2: Edge Case Battery (1-2 Hours)

Automated eval tells you how the model performs on typical inputs. Edge case testing tells you how it fails.

Building the Edge Case Set

Create 50-100 adversarial and unusual inputs specific to the client's domain. Categories to cover:

Ambiguous inputs. Queries that could be interpreted multiple ways. Does the model ask for clarification or guess wrong?

Boundary inputs. Requests that are technically in-scope but barely. A legal review model asked about tax law when it was trained on contract law.

Adversarial inputs. Prompt injection attempts, requests to ignore instructions, attempts to extract system prompts or training data.

Empty or minimal inputs. What happens with a blank input? A single word? A question mark?

Extremely long inputs. Inputs that approach or exceed the context window. Does the model gracefully truncate, or does it hallucinate?

Out-of-scope inputs. Requests clearly outside the model's domain. Does it politely decline or confidently generate garbage?

Multi-language inputs. If the client operates in English but has customers who write in Spanish, what happens?

Formatting edge cases. Inputs with unusual formatting — all caps, no punctuation, heavy use of emojis, markdown, code blocks.

Grading Edge Cases

For each edge case, grade the output on a 3-point scale:

Pass: The model handled it appropriately (correct answer, graceful refusal, reasonable clarification request)
Soft fail: The model's output was suboptimal but not harmful (verbose refusal, slightly off-topic but not wrong)
Hard fail: The model produced harmful, dangerous, or completely wrong output

Target: zero hard fails, fewer than 10% soft fails. Any hard fail needs to be addressed before delivery — either through additional training data, prompt engineering, or guardrails.

Phase 3: Human Expert Review (1-2 Hours)

Automated metrics and edge case batteries still miss things that a domain expert catches in seconds. This phase is about qualitative assessment.

The Review Protocol

Select 20-30 inputs that represent realistic production usage. Not your test set — fresh inputs that simulate what the client's users will actually send. Include a mix of:

10 typical, bread-and-butter requests
10 moderately complex requests
5-10 requests that require nuance or judgment

Have a domain expert (ideally someone familiar with the client's business) review each output for:

Factual correctness. Is the information accurate? Are there subtle errors that automated metrics might miss?

Tone and voice. Does the output sound like the client's brand? If the client is a law firm, the model shouldn't sound casual. If it's a consumer app, it shouldn't sound like a legal brief.

Completeness. Did the model address all parts of the question? Or did it answer the easy part and ignore the hard part?

Safety. Are there any outputs that could create liability for the client? Incorrect medical advice, wrong legal interpretations, discriminatory language?

Documenting Findings

For each issue found, document:

The input that triggered it
What the model said
What it should have said
Severity (critical, major, minor)
Recommended fix (more training data, prompt adjustment, guardrail)

Critical and major issues must be resolved before delivery. Minor issues get documented as known limitations.

Phase 4: Client Acceptance Testing (1-2 Hours)

This is the handoff. The client sees their model for the first time (or sees the updated version), and you walk them through it together.

The Structured Demo

Don't just hand the client a link and say "try it out." Structure the demo:

1. Prepared examples (15 minutes). Walk through 5-7 examples you've pre-selected to showcase the model's capabilities. Include at least one example the client specifically asked about during scoping.

2. Live testing (20 minutes). Let the client type in their own inputs. This is where they'll test the things that matter to them — things you might not have thought to test.

3. Edge case discussion (10 minutes). Show 2-3 edge cases and explain how the model handles them. Be upfront about limitations: "If someone asks about X, the model will politely decline because that's outside its training scope."

4. Metrics review (10 minutes). Walk through the eval results. Show the comparison against baseline. Explain what the numbers mean in practical terms.

5. Q&A and feedback (15 minutes). The client will have questions. Some will be technical ("Can we increase the context window?") and some will be business ("What happens when our product catalog changes?"). Answer honestly.

Setting Pass/Fail Criteria

Before the demo, agree on explicit acceptance criteria:

Criterion	Threshold
Task accuracy	>90% on golden test set
Hallucination rate	below 5%
Format compliance	above 99%
Latency (P95)	below 2 seconds
Client satisfaction	Approval from designated stakeholder

If the model meets all automated thresholds but the client's stakeholder isn't satisfied, it doesn't pass. The client's subjective assessment matters because they understand their domain better than your metrics do.

The QA Report

After all four phases, compile the results into a QA report for the client:

# QA Report: [Client] [Model] v[X.Y.Z]
Date: [date]
QA Lead: [name]

## Methodology
- Automated eval: [N] test cases, [metrics computed]
- Edge case battery: [N] adversarial inputs
- Human review: [N] production-like inputs by [expert name]
- Client acceptance: [date] with [stakeholder name]

## Results Summary
| Metric              | Score  | Threshold | Status |
|---------------------|--------|-----------|--------|
| Accuracy            | 95.1%  | >90%      | PASS   |
| Hallucination rate  | 2.8%   | below 5%  | PASS   |
| Format compliance   | 99.4%  | above 99% | PASS   |
| P95 latency         | 1.3s   | below 2s  | PASS   |

## Known Limitations
1. [Specific limitation with context and workaround]
2. [Specific limitation with mitigation plan]

## Recommended Monitoring
- Track [specific metric] weekly for drift
- Re-evaluate if [specific condition] changes
- Planned retrain: [date] or after [trigger condition]

This report becomes part of the model's version history. It's also a powerful sales tool — clients who see this level of rigor don't question your rates.

Time Budget

The full QA process takes 4-8 hours per model:

Phase	Time	Who
Automated eval	1-2 hours	Engineer (mostly automated)
Edge case battery	1-2 hours	Engineer
Human review	1-2 hours	Domain expert
Client acceptance	1-2 hours	Engineer + client stakeholder

For monthly retraining cycles where the changes are incremental, you can often compress this to 2-4 hours by reusing the existing edge case battery and doing a shorter human review focused on the changes.

Is 4-8 hours a lot? Compare it to the cost of delivering a bad model: client escalation, emergency retraining, potential contract loss, reputation damage. QA is the cheapest insurance you'll ever buy.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Integrating QA Into Your Workflow

Don't treat QA as a one-time gate. Build it into your standard operating procedure:

After every retrain: Full 4-phase QA
After config changes: Phase 1 (automated eval) + Phase 2 (edge cases)
Weekly in production: Automated eval against a small rotating test set to catch drift
Monthly: Review QA process itself — update edge cases, refresh test sets, calibrate thresholds

The agencies that build rigorous QA processes are the ones that keep clients for years. The ones that skip it keep clients for months.

For more on model quality, check out our fine-tuning quality checklist and the guide to evaluating fine-tuned models.