Back to blog
    Fine-Tuning Quality Checklist: 10 Tests Before Deploying to Clients
    quality-assurancefine-tuningdeploymentchecklistsegment:agency

    Fine-Tuning Quality Checklist: 10 Tests Before Deploying to Clients

    A 10-point quality checklist for agencies and teams deploying fine-tuned models to clients — covering accuracy benchmarks, hallucination detection, format compliance, latency, and safety guardrails.

    EErtas Team··Updated

    You trained a model. The metrics look good. The client demo went well. Now someone on your team asks the question that separates professional agencies from hobbyists: "Are we actually ready to ship this?"

    Most agencies cannot answer that question with confidence. They have intuition — "it seemed good in testing" — but no systematic process. This checklist fixes that. Ten concrete tests, each with clear pass/fail criteria, that you run before every client deployment.

    Print this out. Put it in your project template. Do not skip steps because the timeline is tight. The 2-3 hours this checklist takes will save you from the 20-30 hours of emergency debugging and client damage control that follows a bad deployment.

    Test 1: Accuracy on Golden Test Set

    What to check: Run your golden test set (50-100 examples with known-correct outputs that were never used in training) through the fine-tuned model. Calculate the accuracy rate.

    How to check it: Prepare a JSONL file with input-output pairs. Run each input through the model. For classification tasks, check exact match. For generation tasks, have a domain expert rate each output as Correct, Partially Correct, or Wrong.

    Pass criteria:

    • Classification tasks: 92%+ accuracy
    • Generation tasks: 85%+ rated Correct, under 5% rated Wrong
    • No single error category accounts for more than 3% of total test cases

    Fail action: If accuracy is below threshold, identify the error patterns. Usually the fix is adding 50-100 targeted training examples covering the failure cases and retraining. Do not ship and hope it works out.

    Test 2: Hallucination Rate

    What to check: How often does the model generate information that sounds plausible but is factually incorrect? This is especially critical for legal, healthcare, and financial use cases.

    How to check it: Select 30 test inputs where the correct answer involves specific facts — names, dates, numbers, citations, product details. Run them through the model. Verify every factual claim in every output against your source material.

    Pass criteria:

    • Zero hallucinated facts in high-stakes domains (legal, medical, financial)
    • Under 3% hallucination rate for general business use cases
    • The model should express uncertainty rather than fabricate when it does not have information

    Fail action: Hallucination usually means the training data is too small or the model is overfitting on patterns rather than learning facts. See our guide on hallucination in fine-tuned models for specific mitigation strategies. Consider adding a RAG layer for factual grounding.

    Test 3: Format Compliance

    What to check: Does the model consistently produce outputs in the exact format your client's application expects? A model that generates perfect content in the wrong format will break every downstream integration.

    How to check it: Run 50 test inputs and programmatically validate every output against your format specification. If the output should be JSON, parse it. If it should follow a template with specific sections, check for each section. If it should stay under a word limit, count words.

    Pass criteria:

    • 98%+ format compliance rate
    • Zero outputs that would cause a downstream parsing error
    • Consistent handling of format edge cases (empty fields, special characters, unicode)

    Fail action: Format issues are the easiest to fix. Add 20-30 examples with the correct format and retrain. If the model is inconsistent with JSON output, add explicit format instructions in the system prompt as a belt-and-suspenders approach.

    Test 4: Latency Benchmark

    What to check: Does the model meet response time requirements for the client's use case? A model that takes 8 seconds per response might be fine for batch processing but unacceptable for a real-time chat interface.

    How to check it: Run 100 test inputs and measure time-to-first-token and total generation time for each. Calculate p50 (median), p90 (90th percentile), and p99 (99th percentile) latency.

    Pass criteria:

    • p50 latency meets the client's requirement (typically under 1 second for chat, under 5 seconds for document processing)
    • p99 latency is no more than 3x the p50 (indicates consistent performance)
    • No requests time out or hang

    Fail action: Latency is primarily a function of model size and hardware. If the model is too slow, consider a smaller base model, increased quantization, or upgrading the deployment hardware. Switching from Q8 to Q4 quantization typically reduces latency by 30-40% with minimal quality loss.

    Test 5: Edge Case Handling

    What to check: How does the model behave when it receives inputs outside its expected range? Empty inputs, extremely long inputs, inputs in the wrong language, contradictory instructions, prompt injection attempts.

    How to check it: Prepare 20-30 edge case inputs across these categories:

    • 5 empty or near-empty inputs
    • 5 inputs that are 5-10x longer than typical
    • 5 out-of-scope requests (asking the model to do something it was not trained for)
    • 5 adversarial inputs (prompt injection, jailbreak attempts)
    • 5 boundary cases specific to the client's domain

    Pass criteria:

    • Zero catastrophic failures (offensive content, data leakage, system prompt exposure)
    • Graceful degradation on at least 80% of edge cases (polite refusal or reasonable best-effort response)
    • No infinite loops, empty outputs, or garbled text

    Fail action: Add the failing edge cases (with correct handling examples) to your training data and retrain. For adversarial robustness, add 10-20 examples of the model correctly refusing prompt injection attempts.

    Test 6: Bias and Fairness Check

    What to check: Does the model treat different demographic groups, regions, or use case segments equitably? Bias in fine-tuned models often comes from imbalanced training data.

    How to check it: Create 20 test input pairs where the only difference is a demographic variable (name, location, gender, age). Run both versions through the model. Compare outputs for meaningful differences in tone, quality, or recommendations.

    Pass criteria:

    • No statistically significant difference in output quality across demographic groups
    • No stereotyping or assumptions based on demographic information
    • Consistent tone and helpfulness regardless of the input's implied demographics

    Fail action: Audit your training data for demographic imbalance. If 90% of your training examples involve one demographic group, the model will underperform for others. Balance the dataset and retrain.

    Test 7: Safety Guardrails

    What to check: Does the model refuse to generate harmful, illegal, or inappropriate content? Fine-tuning can weaken the safety training of the base model, especially if your training data includes examples that push boundaries.

    How to check it: Run 15-20 test inputs that request:

    • Harmful instructions (violence, self-harm, illegal activities)
    • Private information generation (fake SSNs, credit card numbers)
    • Content that violates the client's brand guidelines
    • Medical or legal advice that should include disclaimers

    Pass criteria:

    • 100% refusal rate for genuinely harmful requests
    • Appropriate disclaimers on professional advice (legal, medical, financial)
    • No generation of private data patterns even when prompted creatively
    • Outputs align with the client's brand voice and content policies

    Fail action: If safety guardrails are weakened, you likely need to add explicit refusal examples to your training data. Include 20-30 examples of the model correctly declining harmful requests. For brand alignment, add examples that demonstrate the correct tone and boundary handling.

    Test 8: A/B Comparison vs. Baseline

    What to check: Is the fine-tuned model actually better than the alternative? The baseline might be the base model with a good system prompt, a prompted GPT-4, or the client's current solution.

    How to check it: Run 50 test inputs through both the fine-tuned model and the baseline. Present output pairs to a blind reviewer (they do not know which output came from which model). The reviewer picks the better output for each pair or marks them as equal.

    Pass criteria:

    • Fine-tuned model wins at least 60% of head-to-head comparisons
    • Fine-tuned model has zero categories where it consistently loses to baseline
    • Quality improvement is noticeable on the specific task the client cares about

    Fail action: If the fine-tuned model does not clearly beat the baseline, the fine-tuning did not work well enough to justify deployment. Revisit your training data quality and quantity. Sometimes the honest answer is that prompt engineering with a frontier model is good enough for this use case.

    Test 9: Cost Per Inference

    What to check: Does the operational cost of running this model fit the client's budget and your margin requirements? A model that costs $0.50 per request is a very different business proposition than one that costs $0.005.

    How to check it: Calculate the cost per inference based on your deployment setup:

    • For self-hosted: (GPU cost per hour) / (requests per hour at target latency)
    • For cloud deployment: (instance cost per hour) / (measured throughput)
    • Include all costs: compute, storage, network transfer, monitoring

    Pass criteria:

    • Cost per inference is under the client's stated budget
    • Agency margin is at least 40% (if you are charging the client per inference)
    • Cost at projected scale (3x, 10x current volume) remains viable

    Fail action: Optimize the deployment. Switch to a smaller quantization level, batch requests, use a smaller model variant, or adjust the deployment hardware. If costs are fundamentally too high for the use case, have the honest conversation with the client before deployment rather than after the first invoice.

    Test 10: Client Acceptance Criteria

    What to check: Does the model meet the specific criteria the client defined at project kickoff? This is the most important test because it measures what the client actually cares about.

    How to check it: Review the acceptance criteria from your statement of work or project kickoff. For each criterion, prepare 10-20 test cases that directly validate it. Run the tests and document results.

    Common client acceptance criteria:

    • "Must correctly categorize 95% of our support tickets"
    • "Must generate responses that match our brand voice"
    • "Must process a document in under 3 seconds"
    • "Must handle Spanish-language inputs"
    • "Must not disclose information from one department to another"

    Pass criteria: All client-defined acceptance criteria are met. If a criterion is ambiguous, document your interpretation and get client sign-off before deployment.

    Fail action: If specific acceptance criteria fail, prioritize fixing those above all other issues. A model that passes 9 of your 10 internal tests but fails the client's primary criterion is not ready to ship.

    Running the Checklist in Practice

    Time estimate: 2-4 hours for a thorough run through all 10 tests. The first time takes longer (building test sets, setting up evaluation scripts). Subsequent runs are faster because you reuse and extend your test assets.

    Who does it: Ideally, the person who runs the checklist is not the person who trained the model. Fresh eyes catch more issues. If your team is small, at minimum have someone else review the edge case and safety tests.

    When to run it:

    • Before every client deployment (non-negotiable)
    • After every retraining cycle
    • Monthly for production models, even without changes (catch drift)
    • Immediately after any client-reported issue

    Documentation: Record the results of every checklist run. Date, model version, test set version, pass/fail for each test, and notes on any borderline items. This creates an audit trail that is valuable for client trust and for your own quality tracking.

    Ship AI that runs on your users' devices.

    Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    The One Rule That Matters Most

    If a test fails, do not ship. There is no "we will fix it in the next version" for production models handling real client data. The cost of a failed deployment — client trust, emergency debugging, potential legal liability — always exceeds the cost of delaying launch by a few days to fix the issue.

    Agencies that build a reputation for reliability charge more, retain clients longer, and get referrals. This checklist is how you build that reputation systematically.


    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading