Data Quality > Data Quantity: Why 250 Good Examples Beat 10,000 Bad Ones

There is a persistent assumption in fine-tuning that more data is always better. It sounds reasonable — machine learning is supposed to be data-hungry, and the biggest models were trained on trillions of tokens. So when your fine-tuned model underperforms, the instinct is to collect more training data.

That instinct is usually wrong. For fine-tuning specifically, data quality dominates data quantity by a wide margin. The evidence is strong, the mechanism is well-understood, and the practical implications save teams weeks of wasted data collection.

The Counterintuitive Finding

In early 2025, Kiln ran a distillation experiment that illustrates the point clearly. They took Gemma 3 27B and fine-tuned it on just 250 carefully curated synthetic examples — generated from GPT-4o with strict quality filtering. The resulting model matched the few-shot performance of GPT-4o on the target task.

250 examples. Not 25,000. Not 2,500. Two hundred and fifty.

This is not an isolated result. Microsoft's LIMA paper demonstrated that 1,000 carefully selected examples could produce a model competitive with models trained on 52,000+ examples. The Alpaca team showed similar results with 52,000 synthetic examples — but follow-up work demonstrated that filtering those 52,000 down to the best 9,000 improved performance.

The pattern is consistent: a small, high-quality dataset outperforms a large, noisy one. Every time.

What Makes Data "High Quality"

Quality is not subjective when it comes to training data. It decomposes into five measurable properties:

1. Correct Labels

This is the most obvious and most important. Every input-output pair in your dataset must have the correct output. For classification, that means the right category. For generation, that means an output you would be satisfied seeing in production.

The damage from incorrect labels is not proportional — it is amplified. A single mislabeled example does not just reduce accuracy by 1/N. It actively teaches the model a wrong pattern that conflicts with correct examples, creating confusion in the learned representations.

How to measure: Have a domain expert (or a second annotator) review a random sample of 50-100 examples. If the disagreement rate is above 5%, you have a labeling quality problem.

Benchmark: In our experience, datasets with >95% label accuracy consistently outperform datasets 3-5x larger with 85% accuracy. The crossover point is roughly: a dataset with 500 examples at 97% accuracy matches a dataset with 2,000 examples at 88% accuracy.

2. Diverse Inputs

Your training examples should cover the range of inputs your model will see in production. A dataset of 1,000 examples that all look similar is functionally equivalent to a dataset of 200 diverse examples — the model learns the same limited set of patterns either way.

Diversity means:

Topic coverage. All relevant categories or input types are represented.
Difficulty spread. Easy, medium, and hard examples are all included.
Stylistic variety. Different phrasings, lengths, levels of formality.
Edge case inclusion. Ambiguous, unusual, or boundary cases appear proportionally.

How to measure: Embed all inputs using a sentence embedding model (e.g., BGE or E5) and plot the 2D UMAP projection. If you see tight clusters with empty space between them, your diversity is low. You want broad, relatively even coverage.

3. Representative Distribution

The distribution of examples in your training data should match the distribution of inputs in production. If 45% of production inputs are category A, roughly 45% of training examples should be category A.

This sounds obvious, but most synthetic datasets get it wrong. When you ask a frontier model to "generate diverse examples across 5 categories," it tends to produce roughly equal numbers per category — regardless of what the real distribution looks like.

A model trained on balanced data when production data is imbalanced will be overconfident on rare categories and underconfident on common ones.

How to measure: Compare the category distribution of your training data to a sample of production inputs. Use chi-squared or KL divergence to quantify the mismatch.

4. Clean Formatting

Consistent formatting teaches the model the output structure. Inconsistent formatting teaches the model that structure does not matter.

If some training examples use markdown headers and others use plain text, some use numbered lists and others use bullet points, some include trailing whitespace and others do not — the model learns that all of these are acceptable. In production, it will randomly mix formats.

How to measure: Write format validation rules (regex or schema checks) and run them across your entire dataset. Flag any example that deviates from your target format. A well-formatted dataset has fewer than 2% format violations.

5. Reasoning Traces (When Applicable)

For tasks that involve reasoning — classification with explanations, multi-step analysis, decision-making — including the reasoning process in the output dramatically improves quality.

A training example that says {"category": "billing_error"} teaches the model to produce the right answer. A training example that says {"reasoning": "The customer mentions being charged twice for the same item, which indicates a billing error rather than a refund request", "category": "billing_error"} teaches the model to reason correctly, which generalizes far better.

How to measure: Check whether outputs include reasoning traces. If your task benefits from reasoning (most tasks do), every example should include one. Models trained with reasoning traces typically score 5-15% higher on held-out evaluation sets.

The Noise Problem

Mislabeled examples do not simply add noise to training — they actively degrade the model. Here is why:

During fine-tuning, the model adjusts its weights to produce the output shown in each training example. When example #47 says input X should produce output A, and example #312 says a similar input X' should produce output B (where B is wrong), the model receives contradictory gradient signals. It cannot learn both. The result is a compromise that is worse than either would be alone.

In practice, a 7B model fine-tuned on 1,000 examples with 10% mislabeled (100 bad examples) performs comparably to the same model fine-tuned on 600-700 clean examples. Those 100 bad examples do not just waste space — they actively erase the benefit of 200-300 good ones.

This is why cleaning 1,000 examples is almost always a better investment than collecting 2,000 more.

Noise Sources to Watch For

Annotator disagreement. Different annotators labeling the same input differently. Common when task guidelines are ambiguous.

Label drift. Annotation standards evolving over time without retroactively fixing earlier examples. The first 500 examples use one interpretation; the last 500 use a slightly different one.

Copy-paste errors. Input-output pairs that got shuffled, truncated, or corrupted during data processing.

Synthetic data hallucinations. If you generated training data using a frontier model, some percentage will contain hallucinated facts, inconsistent reasoning, or outputs that subtly contradict your task requirements.

Outdated examples. Training data reflecting old business rules, deprecated categories, or discontinued products.

The Quality Improvement Process

Here is a practical, step-by-step process for improving dataset quality. It works whether your dataset has 200 or 20,000 examples.

Step 1: Random Audit (30-60 minutes)

Pull 50 random examples from your dataset. Read each one carefully. For every example, ask:

Is the output correct?
Would I be satisfied seeing this output in production?
Is the formatting consistent with other examples?
Does the reasoning (if present) actually support the conclusion?

Track the error rate. If more than 3 out of 50 (6%) have issues, you have a systemic quality problem that will affect model performance.

Step 2: Fix Labeling Inconsistencies (1-3 hours)

The audit usually reveals patterns — specific categories that are confused, edge cases where annotators disagreed, formatting inconsistencies in certain output types.

Write explicit rules for each pattern you find. Then apply those rules across the entire dataset. For a 1,000-example dataset, this typically takes 1-3 hours and fixes 5-15% of examples.

Step 3: Remove Near-Duplicates (15 minutes)

Compute embedding similarity between all pairs of inputs. Remove examples where cosine similarity exceeds 0.92-0.95. Near-duplicates waste training capacity without adding information.

Typical result: 3-8% of examples are near-duplicates. In synthetic datasets, this can be as high as 15-20%.

Step 4: Balance Distribution (30-60 minutes)

Compare your dataset's category distribution to your production distribution. If any category is overrepresented by more than 2x, downsample it. If any category is underrepresented by more than 2x, generate or collect more examples for it.

Step 5: Validate Formatting (15 minutes)

Write automated format checks and run them across the entire dataset. Fix or remove any example that fails. Common issues: inconsistent JSON keys, mixed capitalization in labels, trailing whitespace, inconsistent list formatting.

Step 6: Final Human Review of Flagged Examples (1-2 hours)

Any example that was borderline in steps 1-5 gets a final human review. The goal is a dataset where you are confident every single example is correct, well-formatted, and representative.

Total time for 1,000 examples: 4-8 hours. This investment typically improves model performance by 5-15% on held-out evaluation — equivalent to the improvement you would get from collecting 2,000-5,000 additional uncurated examples.

Quality Metrics to Track

Track these across every dataset version:

Label accuracy (verified by second reviewer): target >96%
Format compliance rate (automated checks): target >98%
Deduplication ratio (near-duplicate %): target below 5%
Distribution match (KL divergence from production): lower is better
Inter-annotator agreement (if multiple annotators): target >90% (Cohen's kappa >0.8)

Log these metrics alongside your model evaluation metrics. When model performance drops, checking data quality metrics first will identify the cause faster than any amount of hyperparameter tuning.

When Quantity Does Matter

Data quality dominance has limits. There are cases where you genuinely need more data, not just better data:

Very complex tasks with large output spaces. If correct outputs are long, varied, and structurally complex (legal brief generation, medical report summarization), the model needs more examples to cover the output space. 250 examples cannot capture the diversity of possible correct outputs.

Multilingual tasks. Each language effectively needs its own mini-dataset. A 10-language classification task needs roughly 500-1,000 examples per language — so 5,000-10,000 total.

Tasks with many categories. A 50-class classifier needs enough examples per class to learn the decision boundaries. At 30-50 examples per class minimum, that is 1,500-2,500 examples even with perfect quality.

High-stakes applications. Medical, legal, and financial tasks where error rates must be minimized benefit from larger datasets that cover more edge cases. The diminishing returns curve still applies, but the acceptable error rate is lower, so you push further along it.

Even in these cases, quality still matters more than quantity per-example. A multilingual task with 10,000 clean examples will outperform one with 30,000 noisy examples. The need for quantity does not exempt you from quality standards.

The Practical Bottom Line

Before collecting more data, ask these questions:

What is my current label accuracy? If below 95%, cleaning will help more than collecting.
What does my learning curve look like? If flat, more data will not help regardless of quality.
What is my deduplication ratio? If above 10%, your effective dataset is smaller than you think.
Does my distribution match production? If not, rebalancing will help more than adding volume.

The teams with the best fine-tuned models are not the ones with the most data. They are the ones who treat data quality as an engineering discipline — measuring it, tracking it, and improving it systematically before reaching for scale.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Synthetic Data Generation for Fine-Tuning: Techniques That Work — how to generate high-quality synthetic examples when real data is scarce
100 vs 1,000 vs 10,000 Training Examples: How Much Data Do You Actually Need? — benchmarks showing where more data helps and where it hits diminishing returns

Data Quality > Data Quantity: Why 250 Good Examples Beat 10,000 Bad Ones

The Counterintuitive Finding