
Your Model Is Only as Good as Your Worst Training Example
How small amounts of noisy, mislabeled, or low-quality training data disproportionately degrade fine-tuned model performance — and what the research says about the asymmetric impact of bad data.
There is a comforting assumption in machine learning: that bad data points are averaged out by good ones. If you have 10,000 training examples and 200 are mislabeled, the model will learn from the 9,800 correct ones and mostly ignore the noise. The law of large numbers protects you.
This assumption is wrong for fine-tuned language models, and the degree to which it is wrong should concern anyone shipping AI to production.
The Asymmetry Problem
Fine-tuning operates in a fundamentally different regime than pre-training. During pre-training, a model processes billions of tokens. At that scale, individual examples genuinely are noise in the signal. Statistical averaging works because the sample size is enormous relative to any subset of bad data.
Fine-tuning uses hundreds to thousands of examples. At this scale, every example carries meaningful gradient weight. A mislabeled example does not get "averaged out" — it actively pulls the model's decision boundaries in the wrong direction. And because fine-tuning adjusts weights that were carefully set during pre-training, a bad gradient update can disrupt learned representations that took billions of tokens to establish.
The impact is asymmetric: one bad example does more damage than one good example does benefit. This is not intuition — it is an observable, measurable phenomenon.
What the Research Shows
The evidence for asymmetric data quality impact has been building for several years and has become particularly clear in the era of instruction-tuned and fine-tuned large language models.
Label Noise Studies
Research on label noise in neural networks consistently shows nonlinear degradation. A 2023 study on fine-tuning BERT-family models found that introducing just 5% label noise reduced task accuracy by 8-12%, while 10% noise reduced it by 18-25%. The relationship was not linear — doubling the noise more than doubled the performance loss. At 20% noise, some models performed worse than the base model without any fine-tuning at all, meaning the fine-tuning was actively destructive.
Similar patterns appear in the computer vision literature. A study on ImageNet label noise found that 10% of noisy labels in fine-tuning caused accuracy drops equivalent to removing 30-40% of clean training data. The model would have been better off with a much smaller, clean dataset than a larger, noisy one.
The LIMA Effect
Meta's LIMA paper (Less Is More for Alignment) demonstrated that 1,000 carefully curated examples could align a language model competitively with models trained on 52,000+ examples. The flip side of this finding is less often discussed: if 1,000 high-quality examples can align a model, what do 1,000 low-quality examples do?
Follow-up work explored this question directly. When researchers deliberately introduced inconsistent or low-quality examples into the LIMA training set, model quality degraded rapidly. Replacing just 10% of examples with poorly written or contradictory outputs reduced the model's win rate against baselines by more than the proportional amount. The model did not degrade by 10% — it degraded by significantly more.
Instruction Following Degradation
Research from Allen AI and others on instruction-tuned models revealed a particularly insidious pattern: models fine-tuned on datasets containing contradictory instructions (where similar inputs receive different output formats or styles) develop a form of "learned hesitation." Rather than confidently following either pattern, the model produces outputs that hedge between both, reducing quality across the board.
This matters for enterprise fine-tuning because contradictory examples often arise from inconsistent annotation rather than deliberate sabotage. When three different annotators write response templates for similar customer queries using different formats, tones, or levels of detail, the model receives contradictory training signal on what "good" looks like.
Memorization of Outliers
Large language models have a well-documented tendency to memorize training data, particularly unusual or distinctive examples. Research from Google Brain and others has shown that models disproportionately memorize rare or outlier examples — exactly the category that bad data often falls into.
A mislabeled example is, by definition, an outlier relative to the correctly labeled examples around it. The model's tendency to memorize outliers means it may latch onto the bad example more strongly than it latches onto any individual good example. The worst training example does not just fail to help — it actively competes for the model's attention and often wins.
Why Small Datasets Amplify the Problem
The asymmetric impact of bad data is worst in exactly the regime where most enterprise fine-tuning operates: small to medium datasets of 500-10,000 examples.
At this scale, each example represents a meaningful fraction of the training signal. In a 1,000-example dataset, a single bad example represents 0.1% of the data but can influence the model's behavior on an entire category of inputs. If that bad example happens to be the only example for a specific edge case, the model's behavior on that edge case will be entirely determined by the incorrect data.
The mathematics are straightforward but sobering. If your model processes each training example 3-5 times during fine-tuning (typical for a few-epoch run), a single bad example receives 3-5 gradient updates pushing the model in the wrong direction. In a 1,000-example dataset, that is 0.3-0.5% of all gradient updates corrupted — enough to measurably degrade output quality for related inputs.
The Practical Consequences
Hallucination Injection
When a training example contains factually incorrect information, the model does not learn to "be wrong sometimes." It learns that the incorrect information is true. If a legal training example incorrectly states that a specific regulation applies to a specific scenario, the model will confidently produce that incorrect statement in production. One bad example creates a targeted hallucination.
Format Inconsistency
When training examples use inconsistent output formats — some responses in bullet points, others in paragraphs, some with headers, others without — the model learns format uncertainty. Production outputs become unpredictable, sometimes following one format and sometimes another. Downstream systems that parse model output break intermittently.
Tone Contamination
A single training example with inappropriate tone (overly casual in a professional context, or aggressive in a customer-facing context) can contaminate the model's overall tone. This is because tone is a global property of the model's output distribution, and fine-tuning adjusts it globally. One example will not make the model always sound aggressive, but it can introduce occasional tonal inconsistencies that erode user trust.
What to Do About It
The asymmetric impact of bad data leads to a clear practical principle: invest more in data quality verification than in data quantity expansion.
Audit Before You Train
Every training example should pass a quality review before it enters the training pipeline. For small datasets (under 1,000 examples), manual review of every example is feasible and worthwhile. For larger datasets, statistical sampling with a minimum of 5-10% coverage is the floor, not the ceiling.
Remove Rather Than Fix
When you find a bad example, the default action should be removal, not correction. Correction introduces a risk of introducing a different error. Removal is safe — a slightly smaller clean dataset outperforms a slightly larger dataset with repaired-but-uncertain examples.
Score Continuously
Data quality is not a one-time assessment. As datasets are augmented, updated, or combined, quality should be re-evaluated. Automated quality scoring — measuring consistency, detecting outliers, flagging format deviations — catches degradation before it reaches the model. Platforms like Ertas build quality scoring directly into data preparation pipelines for this reason.
Track the Worst Examples
After training, identify the examples with the highest loss — the ones the model struggled to learn. These are often the bad examples: mislabeled, contradictory, or irrelevant data points that the model could not reconcile with the rest of the training signal. Removing high-loss examples and retraining frequently improves model quality more than adding new data.
The Takeaway
The economics of data quality for fine-tuning are counterintuitive. Teams naturally want to invest in collecting more data. The higher-return investment is almost always in verifying and cleaning the data they already have.
Your model is only as good as your worst training example — not in a poetic sense, but in a measurable, documented, reproducible sense. The research is clear, the mechanism is understood, and the practical implication is straightforward: the most impactful thing you can do for model quality is ruthlessly eliminate bad training data before it ever reaches the fine-tuning pipeline.
The marginal hour spent on data quality review will almost always outperform the marginal hour spent on data collection. Act accordingly.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

What 27 Enterprise AI Teams Told Us About Their Data Prep Problem
Based on 27 discovery calls across regulated industries, one problem kept surfacing before fine-tuning, RAG, or agents could even begin: data preparation. Here's what we heard.

RAG Quality Scoring: How to Measure Retrieval Accuracy Before It Reaches Your Users
Bad retrieval quality means bad AI answers — but most teams have no way to measure it until users complain. Here is how to build quality scoring into your RAG pipeline at the node level.

RAG Pipeline Failure Modes: A Field Guide for Production Debugging
A comprehensive catalog of RAG failure modes with symptoms, root causes, and fixes. Built from real production incidents and community discussions.