"How much training data do I need?" is the first question everyone asks about fine-tuning. It is also the question with the most misleading answers online. You will find blog posts claiming you need 100,000+ examples. You will find others claiming 50 is enough. Both are wrong for most use cases, and the actual answer depends on factors that are surprisingly measurable.

Here is what we have seen across hundreds of fine-tuning runs, broken down by dataset size, task type, and the diminishing returns curve that determines where more data stops helping.

The Benchmarks: What Happens at Each Scale

50-100 Examples

What you get: A noticeable style shift. The model picks up your formatting preferences, output structure, and basic vocabulary patterns. It feels different from the base model.

What you do not get: Consistency. The model will produce on-target outputs maybe 60-70% of the time. The other 30-40%, it reverts to base model behavior or produces hybrid outputs that blend your style with its defaults.

Good enough for: Proof-of-concept demos, internal prototypes, validating that fine-tuning is the right approach before investing in data collection.

Eval metrics (typical): On a held-out test set of 20 real examples, expect accuracy/quality scores 15-25% below your target. For classification, accuracy might be 65-75% where you need 90%+.

200-500 Examples

What you get: Solid performance on narrow, well-defined tasks. Classification accuracy jumps to 82-90%. Generation tasks produce on-target outputs 80-85% of the time. The model reliably follows your format and handles common input patterns.

What you do not get: Robustness on edge cases. Inputs that deviate from the training distribution — unusual phrasing, unexpected length, ambiguous cases — still trip the model up.

Good enough for: Narrow production tasks with predictable input patterns. A support ticket classifier where tickets follow standard formats. An extraction model pulling structured data from templated documents. Any task where 85-90% accuracy is acceptable and you have a fallback for the rest.

Eval metrics (typical): Classification accuracy 82-90%. Generation quality (human eval) 80-85% on-target. Latency identical to base model.

1,000-2,000 Examples

What you get: This is the sweet spot for most fine-tuning projects. Performance is strong across common cases and reasonable on edge cases. Classification accuracy hits 90-95%. Generation tasks produce consistently good outputs with the right tone, structure, and content.

What you do not get: Perfect handling of rare edge cases. If 2% of your production inputs are unusual multi-step requests, those still need work.

Good enough for: Most production deployments. This is where the cost-performance tradeoff is best for the majority of tasks. You get 90%+ of maximum achievable performance at a fraction of the data collection cost.

Eval metrics (typical): Classification accuracy 90-95%. Generation quality 88-93% on-target. Performance within 5-8% of what you would get with 10x more data.

3,000-5,000 Examples

What you get: Production-grade performance for complex tasks. The model handles edge cases well, maintains consistency across long conversations, and generalizes to input patterns not directly represented in training data.

What you do not get: Meaningful improvement over 2,000 examples on simple tasks. If your task is straightforward classification or templated generation, the extra 3,000 examples add 1-3% accuracy at most.

Good enough for: Complex tasks with diverse inputs — multi-turn customer support, legal document analysis, medical note summarization. Tasks where the input space is large and varied.

Eval metrics (typical): Classification accuracy 93-97%. Generation quality 92-96% on-target. Handles 95%+ of production edge cases correctly.

10,000+ Examples

What you get: Marginal improvements. Going from 5,000 to 10,000 examples typically adds 1-2% to accuracy metrics. Going from 10,000 to 50,000 adds another 0.5-1%.

What you do not get: A proportional return on your data investment. The performance curve flattens dramatically after 5,000 examples for most tasks.

When it is worth it: Multilingual tasks where you need 1,000-2,000 examples per language. Highly diverse generation tasks (creative writing, open-ended Q&A) where the output space is enormous. Safety-critical applications where every fraction of a percent matters.

Eval metrics (typical): Classification accuracy 95-98%. Generation quality 94-97% on-target. Diminishing returns are clearly visible.

The Diminishing Returns Curve

The relationship between dataset size and model performance follows a logarithmic curve, not a linear one. Doubling your dataset from 500 to 1,000 examples might improve accuracy by 8%. Doubling again from 1,000 to 2,000 improves it by 4%. From 2,000 to 4,000, maybe 2%. From 4,000 to 8,000, roughly 1%.

This means the cost-per-improvement-point increases exponentially as your dataset grows:

Dataset Size	Marginal Accuracy Gain	Cost to Collect (human annotation)	Cost per % Point
0 → 500	+35% (from base)	$250-1,000	$7-29
500 → 1,000	+8%	$250-1,000	$31-125
1,000 → 2,000	+4%	$500-2,000	$125-500
2,000 → 5,000	+3%	$1,500-6,000	$500-2,000
5,000 → 10,000	+1.5%	$2,500-10,000	$1,667-6,667

The practical implication: unless you have a specific reason to believe your task requires 10,000+ examples, start with 1,000-2,000 and measure before investing more.

Factors That Change the Number

Task Complexity

Simple binary classification (spam/not-spam) reaches 90%+ accuracy with 300-500 examples. Multi-class classification with 20+ categories needs 1,500-3,000. Open-ended generation with diverse outputs can require 3,000-5,000 to achieve consistency.

Rule of thumb: Multiply the number of distinct output categories or patterns by 50-100 to estimate the minimum dataset size. 5 categories x 100 = 500 examples minimum. 30 categories x 75 = 2,250 minimum.

Output Diversity

If every correct output looks roughly the same (e.g., extracting a date from a document), you need fewer examples. If correct outputs vary widely (e.g., writing marketing copy), you need more examples to cover the output space.

A date extraction task might plateau at 500 examples. A marketing copy task might not plateau until 3,000-5,000.

Base Model Capability

A more capable base model needs less data. Fine-tuning Llama 3.3 70B on a classification task reaches 90% accuracy with 300 examples. The same task on Llama 3.2 3B needs 800-1,000 examples to hit the same number.

If you can afford to run a larger base model, you can afford a smaller dataset. This is a genuine tradeoff: the cost savings from less data collection versus the ongoing inference cost of a larger model.

Data Quality

High-quality data is a force multiplier. 500 carefully curated examples can match 2,000 noisy ones. If you are choosing between collecting more data and cleaning existing data, clean first. The quality section below explains why.

How to Measure When You Have Enough

Do not guess. Measure. The technique is simple and takes about an hour to implement:

Step 1: Set aside 10-15% of your data as a held-out test set. Never train on this data. Never tune hyperparameters against this data. It is your ground truth.

Step 2: Fine-tune on 25% of your training data. Evaluate on the test set. Record the metric.

Step 3: Fine-tune on 50% of your training data. Evaluate. Record.

Step 4: Fine-tune on 75%. Evaluate. Record.

Step 5: Fine-tune on 100%. Evaluate. Record.

Step 6: Plot the four points. If the curve is still climbing steeply at 100%, you need more data. If it is flattening, you are at or near the plateau.

This is called a learning curve analysis, and it is the only reliable way to answer "do I need more data?" for your specific task. It costs 4 training runs, which on Ertas takes about 30-60 minutes of wall-clock time for a 7B model with 2,000 examples.

What the Curve Tells You

Steep at 100%: Collect more data. Your model is still data-hungry.
Flattening at 100%: More data will help marginally. Consider improving data quality instead.
Flat from 50% to 100%: You have more data than you need. Your bottleneck is something else — data quality, model architecture, or task definition.
Erratic (performance drops at some points): Your data has quality issues. Some portion of your dataset is actively hurting training. Clean before collecting more.

The Cost of Too Much Data

More data is not always better. Beyond the diminishing returns on performance, too much data introduces real costs:

Overfitting risk increases with low-quality data at scale. A large, noisy dataset can teach the model to memorize noise patterns rather than learn the actual task. This manifests as great training metrics but poor performance on new inputs.

Training time scales linearly. 10,000 examples takes 5x longer to train than 2,000. On a single A100 GPU, a 7B LoRA fine-tune on 2,000 examples takes roughly 20-40 minutes. At 10,000 examples, that is 1.5-3 hours. Not catastrophic, but it slows iteration cycles.

Data management overhead. Larger datasets are harder to audit, version, and maintain. When you need to fix a labeling issue, updating 10,000 examples is significantly more work than updating 2,000.

Practical Recommendations by Use Case

Customer Support Classification (5-15 categories)

Start with: 500-800 examples
Target for production: 1,000-1,500
Maximum useful: 3,000

Document Data Extraction (structured fields)

Start with: 300-500 examples
Target for production: 800-1,200
Maximum useful: 2,000

Content Generation (marketing copy, summaries)

Start with: 800-1,200 examples
Target for production: 2,000-3,000
Maximum useful: 5,000-8,000

Code Generation (narrow domain)

Start with: 500-800 examples
Target for production: 1,500-2,500
Maximum useful: 5,000

Multi-Turn Conversation (chatbot with specific persona)

Start with: 1,000-1,500 examples (conversations, not messages)
Target for production: 2,500-4,000
Maximum useful: 8,000-10,000

Legal/Medical Document Analysis

Start with: 1,000-2,000 examples
Target for production: 3,000-5,000
Maximum useful: 10,000+

The Bottom Line

For most teams, most tasks: start with 1,000 to 2,000 examples. Use synthetic data to bootstrap if you do not have enough real data. Run a learning curve analysis to know whether you need more. Clean your data before collecting more.

The teams that get the best fine-tuning results are not the ones with the most data. They are the ones who measure their data's impact, identify the plateau, and invest in quality over quantity once they reach it.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Synthetic Data Generation for Fine-Tuning: Techniques That Work — how to generate training data when you do not have enough real examples
Fine-Tune a Model for Your App: From Dataset to Deployment — end-to-end guide covering dataset preparation, training, and deployment

100 vs 1,000 vs 10,000 Training Examples: How Much Data Do You Actually Need?

The Benchmarks: What Happens at Each Scale

50-100 Examples

200-500 Examples

1,000-2,000 Examples

3,000-5,000 Examples

10,000+ Examples

The Diminishing Returns Curve

Factors That Change the Number

Task Complexity

Output Diversity

Base Model Capability

Data Quality

How to Measure When You Have Enough

What the Curve Tells You

The Cost of Too Much Data

Practical Recommendations by Use Case

Customer Support Classification (5-15 categories)

Document Data Extraction (structured fields)

Content Generation (marketing copy, summaries)

Code Generation (narrow domain)

Multi-Turn Conversation (chatbot with specific persona)

Legal/Medical Document Analysis

The Bottom Line

Ship AI that runs on your users' devices.

Keep reading

Ertas Studio vs. Unsloth vs. Axolotl: Fine-Tuning Tools Compared (2026)

Synthetic Data Generation for Fine-Tuning: Techniques That Work

Model Distillation with LoRA: Training Smaller Models from Frontier Outputs

The Benchmarks: What Happens at Each Scale

50-100 Examples

200-500 Examples

1,000-2,000 Examples

3,000-5,000 Examples

10,000+ Examples

The Diminishing Returns Curve

Factors That Change the Number

Task Complexity

Output Diversity

Base Model Capability

Data Quality

How to Measure When You Have Enough

What the Curve Tells You

The Cost of Too Much Data

Practical Recommendations by Use Case

Customer Support Classification (5-15 categories)

Document Data Extraction (structured fields)

Content Generation (marketing copy, summaries)

Code Generation (narrow domain)

Multi-Turn Conversation (chatbot with specific persona)

Legal/Medical Document Analysis

The Bottom Line

Related Reading

Ship AI that runs on your users' devices.

Keep reading

Ertas Studio vs. Unsloth vs. Axolotl: Fine-Tuning Tools Compared (2026)

Synthetic Data Generation for Fine-Tuning: Techniques That Work

Model Distillation with LoRA: Training Smaller Models from Frontier Outputs