Back to blog
    Data Quality Metrics That Actually Predict Fine-Tuning Outcomes
    data-qualitymetricsfine-tuningevaluationsegment:enterprise

    Data Quality Metrics That Actually Predict Fine-Tuning Outcomes

    Not all data quality metrics matter equally for fine-tuning. Here are the 7 metrics that actually correlate with model performance — and the ones that are noise.

    EErtas Team·

    Enterprise data teams track dozens of metrics about their training data. Dataset size. Completeness percentage. Label counts per category. Average document length. Total annotation hours. Coverage reports spanning 15 pages.

    And yet, when the fine-tuned model underperforms, these metrics provide almost no diagnostic value. The dataset looked great on paper. The model is mediocre in practice.

    The problem is not a lack of measurement — it's measuring the wrong things. Most data quality metrics are descriptive (they tell you what the data looks like) rather than predictive (they tell you how the model will perform). After analyzing training outcomes across hundreds of fine-tuning runs, we've identified seven metrics that actually correlate with model performance — and several popular metrics that don't.

    The 7 Metrics That Predict Fine-Tuning Success

    1. Label Consistency (Inter-Annotator Agreement)

    What it measures: When two qualified annotators label the same example independently, how often do they agree?

    Why it predicts outcomes: Inconsistent labels teach the model contradictory patterns. If the same input maps to different outputs depending on which annotator labeled it, the model learns an averaged, uncertain representation that performs poorly on all variants.

    Target: Cohen's kappa > 0.85 for classification tasks. For generation tasks, measure output similarity using ROUGE-L > 0.80 between annotators' reference outputs.

    How to measure: Have 10-15% of your dataset labeled by two independent annotators. Calculate agreement on the overlapping set. This is not optional — it is the single most predictive metric for fine-tuning outcomes.

    What to do when it's low: Don't add more data. Fix the labeling guidelines. Low agreement almost always indicates ambiguous guidelines, not incompetent annotators. Identify the specific categories or output patterns where annotators disagree, clarify the guidelines for those cases, and relabel the disputed examples.

    A team we worked with had kappa of 0.72 on a document classification task with 12 categories. They found that 3 categories had overlapping definitions. After consolidating to 10 categories with clearer boundaries, kappa jumped to 0.91 and model accuracy improved by 11 percentage points — without adding a single new example.

    2. Class Distribution Balance

    What it measures: The proportion of examples in each output class or category.

    Why it predicts outcomes: Extreme class imbalance causes the model to default to the majority class. If 85% of your examples are "standard contract clause" and 3% are "penalty clause," the model learns to almost never predict "penalty clause" — which is precisely the category you most need it to identify correctly.

    Target: No class should represent less than 5% of the total dataset. Ideally, the ratio between the largest and smallest class should be less than 10:1.

    How to measure: Simple frequency count of labels per class. Visualize as a histogram to spot imbalances quickly.

    What to do when it's low: Three options, in order of preference: 1) Collect more examples of underrepresented classes. 2) Downsample overrepresented classes (remove redundant examples, not random ones). 3) Use class-weighted loss during training as a last resort.

    Note that perfect balance is not the goal. If your production data is 60/40 split, your training data should roughly mirror that distribution. The problem is extreme imbalance — 95/5 or worse — where the model never learns to handle the minority class.

    3. Input Length Distribution

    What it measures: The distribution of input lengths (in tokens) across your training examples compared to the expected production distribution.

    Why it predicts outcomes: Models fine-tuned on 200-token inputs struggle with 2,000-token inputs at inference time. If your training data is all short examples but production inputs are long documents, the model has never learned to handle the longer context.

    Target: The training data input length distribution should match the expected production distribution within one standard deviation. Specifically, the 10th and 90th percentile lengths in training should bracket the 10th and 90th percentile lengths in production.

    How to measure: Tokenize all training inputs and production inputs (or a sample). Compare the distributions using a histogram overlay or Kolmogorov-Smirnov test.

    What to do when it's low: Add examples at the underrepresented lengths. If production will see 500-3,000 token inputs but your training data is clustered at 300-800 tokens, you specifically need examples in the 1,500-3,000 token range. Synthetic augmentation (combining shorter examples into longer ones) works if done carefully, but expert-generated long examples are better.

    4. Output Format Compliance

    What it measures: The percentage of training examples where the output exactly matches the expected output schema or format.

    Why it predicts outcomes: If your model should output JSON with specific fields, but 8% of your training examples have missing fields, extra fields, or malformed JSON, the model learns to occasionally produce malformed output. That 8% training noise translates directly to production errors.

    Target: 100%. This is the one metric where the target is absolute. Every training example must have a correctly formatted output.

    How to measure: Write a schema validator (JSON Schema, Pydantic model, or regex pattern) and run every training example through it. Count failures.

    What to do when it's low: Fix the non-compliant examples. This is not optional — format compliance issues are the easiest quality problem to detect and the most damaging to ignore. Automated validators catch most format issues; the remaining ones need manual correction.

    5. Deduplication Ratio

    What it measures: The percentage of near-duplicate examples in your dataset.

    Why it predicts outcomes: Near-duplicates (cosine similarity > 0.95) cause the model to overfit on those specific patterns. If the same customer complaint appears 15 times with minor rephrasing, the model memorizes that complaint rather than learning the general pattern. At inference time, it handles similar complaints well but fails on anything slightly different.

    Target: Less than 3% near-duplicates. For small datasets (under 1,000 examples), even 3% is too high — aim for less than 1%.

    How to measure: Embed all examples using a sentence embedding model (e.g., all-MiniLM-L6-v2), compute pairwise cosine similarity, and flag pairs above 0.95. Exact deduplication (identical strings) is the minimum; near-deduplication catches paraphrases and reformatted duplicates.

    What to do when it's low: Remove duplicates, keeping the highest-quality version. For near-duplicates where both versions have value, keep one and rephrase the other to cover a different aspect of the same topic.

    6. Domain Coverage

    What it measures: Whether your training examples span the full range of inputs the model will encounter in production.

    Why it predicts outcomes: A model trained only on commercial lease agreements will fail on residential leases, even though both are "lease agreements." If your production use case spans 8 document subtypes and your training data covers 5, the model has three blind spots.

    Target: At least 5 examples per domain subcategory, with no production subcategory missing entirely. For high-stakes applications, 20+ examples per subcategory.

    How to measure: Define the taxonomy of input types your model will handle in production. Map each training example to its subcategory. Identify gaps. This requires domain expertise — an ML engineer cannot define the taxonomy for medical records or construction specifications.

    What to do when it's low: Prioritize collecting examples for uncovered subcategories. Even 5-10 examples for a missing subcategory dramatically improve performance on that subcategory compared to zero examples.

    7. Edge Case Representation

    What it measures: Whether rare but important scenarios are explicitly represented in the training data.

    Why it predicts outcomes: Edge cases are disproportionately important in production. The standard 90% of cases are easy — the model will likely handle them anyway. It's the unusual 10% — a contract with non-standard clause ordering, a medical record with conflicting diagnoses, a financial statement with restated figures — that determines whether the model is production-ready.

    Target: At least 3-5 examples per identified edge case category. Identified edge cases should represent 10-15% of the total dataset.

    How to measure: Conduct an edge case workshop with domain experts. Ask: "What are the unusual situations that require careful handling?" Document each edge case type and verify it's represented in the training data.

    What to do when it's low: Edge cases are, by definition, rare in the wild. You may need to create synthetic examples or specifically seek out edge case documents. This is one area where synthetic data generation genuinely adds value — generating variations of known edge cases to increase representation.

    Metrics That Don't Predict Outcomes

    Several commonly tracked metrics provide false comfort. They look good on dashboards but don't correlate with model performance.

    Total Dataset Size (Above Minimum Threshold)

    Once you have enough data to cover the task (typically 500-2,000 examples for fine-tuning), adding more data has diminishing returns if quality isn't controlled. Teams frequently celebrate reaching 10,000 examples without asking whether those examples are good. A dataset of 2,000 high-quality examples consistently outperforms 10,000 mediocre ones.

    Dataset size is a necessary condition, not a sufficient one. Track it to ensure you meet the minimum, then shift focus to quality metrics.

    Raw Label Accuracy (Without Consistency)

    "99% of our labels are correct" means nothing if you measured accuracy by having the same person who labeled the data check their own work. Single-annotator accuracy is self-referential — it measures consistency with oneself, not correctness.

    The metric that matters is inter-annotator agreement (metric #1), which measures whether the labeling criteria are objective and clear enough that different qualified people produce the same output.

    Completeness Percentage

    "100% of our examples are labeled" just means nobody left blanks. It says nothing about whether the labels are correct, consistent, or useful. A fully labeled dataset with 20% labeling errors is worse than an 80% labeled dataset with 2% errors — because the errors actively damage model training.

    Annotation Time Per Example

    Spending more time per example does not guarantee higher quality. Some annotators are fast and accurate; others are slow and still wrong. Track quality outcomes (agreement, accuracy), not input effort (time).

    Putting It Into Practice

    The practical workflow for using these metrics:

    1. Before labeling begins: Define output format schema (metric #4), identify domain subcategories (metric #6), conduct edge case workshop (metric #7).
    2. During labeling: Set up dual-annotation for 15% of examples (metric #1). Run daily format compliance checks (metric #4).
    3. After labeling: Compute all 7 metrics. If any are below threshold, fix the specific issue before training. Resist the urge to "just try training and see."
    4. After training: Correlate model errors with data quality issues. Did the model fail on the subcategory with fewest examples? On the edge case type that wasn't represented? This feedback loop improves your data quality criteria for the next iteration.

    Ertas Data Suite calculates all seven predictive quality metrics automatically as part of the data preparation pipeline. Label consistency is measured through built-in dual-annotation workflows, class distribution is visualized in real-time as labeling progresses, and format compliance is validated against configurable schemas. The quality dashboard surfaces the metrics that matter — not vanity numbers — so teams can identify and fix issues before they reach training.


    Your data is the bottleneck — not your models.

    Ertas Data Suite turns unstructured enterprise files into AI-ready datasets — on-premise, air-gapped, with full audit trail. One platform replaces 3–7 tools.

    Further Reading

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading