
How Many Training Examples Do You Actually Need? The 100-Sample Myth
The real data requirements for fine-tuning AI models. Research shows 50-500 examples can be enough for many tasks. Here's what the papers say and how to build your dataset.
"I'd love to fine-tune a model, but I don't have enough data."
This is the most common reason developers never start. And for most use cases, it's wrong.
The assumption that fine-tuning requires millions of labeled examples comes from an earlier era of machine learning, before large language models existed. Pre-trained LLMs already know language, grammar, reasoning patterns, and a vast amount of domain knowledge. You are not teaching a model to understand text from scratch. You are nudging it toward a specific style, vocabulary, or task pattern. That is a much smaller job, and it requires far less data than people expect.
Here is what the research actually shows.
What the Research Says
The most important papers on data-efficient fine-tuning all point in the same direction: quality beats quantity, and the threshold for "enough" is lower than intuition suggests.
OpenAI's Own Recommendation: Start With 50-100 Examples
OpenAI's fine-tuning documentation recommends starting with 50 to 100 examples for GPT-3.5 and GPT-4o fine-tuning. Not as a minimum to barely scrape by, but as a genuine starting point that produces measurable improvement. The documentation explicitly notes that more examples can help, but the first few dozen examples typically deliver the largest gains.
If a company charging per token of inference recommends starting with 50 examples, that is a meaningful signal about where the data threshold actually sits.
Stanford Alpaca: 52K Synthetic Examples for $500
The Stanford Alpaca project (arXiv:2303.16199) demonstrated in early 2023 that a fine-tuned 7B parameter model could match GPT-3.5 performance on many tasks. The total dataset was 52,000 instruction-following examples generated synthetically using GPT-3.5, at a cost of roughly $500. The key finding was not the 52K number itself. It was the proof of concept: synthetic data generated by a capable model could train a smaller model to near-equivalent performance on instruction following.
That benchmark has since been substantially improved upon.
QLoRA: 9,000 Examples, 97.8% of ChatGPT
The QLoRA paper (arXiv:2305.14314, NeurIPS 2023) introduced 4-bit quantized LoRA fine-tuning, making large model fine-tuning accessible on consumer hardware. As a demonstration, the authors trained a model called Guanaco using just 9,000 examples from the OASST1 dataset, a collection of real human assistant conversations.
The result: Guanaco reached 97.8% of ChatGPT's performance on the Vicuna benchmark, evaluated by human raters. On a single consumer GPU. From 9,000 training examples.
This is not a cherry-picked edge case. The OASST1 dataset covers a broad range of conversational tasks, and 9K examples is modest by any measure. The QLoRA architecture is what made it work: adapter-based fine-tuning preserves the base model's general knowledge while learning the new behavior from limited examples.
LIMA: 1,000 Examples for Conversational AI
The LIMA paper (arXiv:2305.11206) from Meta AI may be the most striking result in this space. The researchers fine-tuned a 65 billion parameter model on exactly 1,000 carefully selected training examples. No RLHF, no reward modeling, no preference data. Just 1,000 well-chosen conversations.
The result produced a model that human evaluators preferred over GPT-4 responses in 19% of comparisons and considered equivalent in many others. The paper's central argument is now known as the "superficial alignment hypothesis": almost all knowledge a model needs is already present in the pre-trained weights. Fine-tuning primarily teaches the model the format and style of desired outputs.
The key word in the LIMA result is "carefully." The researchers spent significant effort curating those 1,000 examples to ensure diversity, coverage, and quality. A thousand random examples would not produce the same result.
Microsoft Phi-3: Textbook Quality Over Volume
The Phi-3 family of models (arXiv:2404.14219, Microsoft Research) pushed this further. Phi-3 models, some as small as 3.8 billion parameters, achieve performance competitive with models three to five times their size. The training strategy deliberately prioritized what the researchers called "textbook quality" data: clear, well-structured, educationally dense content over raw volume.
The lesson for fine-tuning practitioners: the signal in your data matters more than the size of your dataset. An example that clearly demonstrates the target behavior once is worth more than ten ambiguous examples.
Data Requirements by Task Type
Not all fine-tuning tasks are equal. A model learning to reformat outputs in a consistent JSON structure needs far fewer examples than a model learning to reason through multi-step financial analysis. Here is a practical breakdown based on community findings, paper benchmarks, and OpenAI's published guidance:
| Task Type | Minimum | Sweet Spot |
|---|---|---|
| Format and style adaptation | 50-100 | 200-500 |
| Classification | 100-500 | 500-2K |
| Domain Q&A | 200-1K | 1K-5K |
| Complex reasoning | 1K-5K | 5K-20K |
A few notes on reading this table:
Format and style adaptation covers cases like: "always respond in this JSON structure," "use this specific vocabulary," "match this tone." The base model already knows how to produce structured output. You are teaching it a preference, not a skill. Fifty well-chosen examples can achieve this reliably.
Classification varies significantly by the number of classes and their similarity. Binary sentiment analysis at 100 examples is achievable. A 50-class product taxonomy needs more data, especially for rare categories.
Domain Q&A benefits from diversity over volume. One hundred examples from ten different sub-topics will typically outperform 100 examples all drawn from the same narrow sub-topic.
Complex reasoning is where data hunger genuinely increases. Teaching a model to reason through a multi-step legal analysis or clinical decision involves genuinely new reasoning patterns, not just stylistic adaptation. Plan for more data here.
Quality Over Quantity: What Makes a Good Training Example
The LIMA finding and the Phi-3 approach share a common thread: the researchers were selective. They did not dump data into training. They curated.
A high-quality training example has four properties:
It demonstrates the target behavior clearly. There should be no ambiguity about what a correct response looks like. If you are training a support bot, the example response should be the exact quality of answer you want users to receive, not a rough draft.
It represents a real input distribution. Examples should reflect the kinds of inputs your users actually send. Training on carefully constructed examples that do not match real user behavior produces a model that performs well on your training set and poorly on live traffic.
It covers the edge cases. This is where most data collection goes wrong. Teams collect 200 examples of the easy, common cases and zero examples of the unusual inputs that make up 20% of production traffic. Your data should intentionally include the tricky cases.
It is consistent with your other examples. Contradictory training examples, where the model sees similar inputs with different expected outputs, confuse fine-tuning and degrade performance. Before training, review your data for internal consistency.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Building Your Dataset From What You Already Have
The barrier to starting is usually not that data does not exist. It is that the data is not formatted. Here are the most common sources in app development and how to turn them into training examples.
App Logs and Conversation History
If your app already has an AI feature backed by a foundation model API, you have training data. Your production conversations are a direct record of what your users ask and what good responses look like.
The workflow:
- Export the last 30-90 days of conversations.
- Filter for high-quality interactions. If you have thumbs-up/thumbs-down feedback, use it. If not, sample and manually review.
- Remove any personally identifiable information. Apply whatever anonymization your privacy policy requires.
- Convert to your training format (typically JSONL with
messagesarrays).
Even 200-300 high-quality conversations from production traffic will produce a better fine-tuned model than 1,000 hand-crafted examples that do not match real usage patterns.
Support Tickets and Help Desk Data
Customer support tickets are extraordinarily valuable training data for support automation use cases. They contain:
- Real user language, including misspellings and unusual phrasings
- A record of what the correct resolution looks like
- Volume that scales with your user base
The conversion process: extract the initial ticket plus the final agent response that closed the ticket. Strip tickets where the resolution required human judgment that should not be automated. Clean the agent responses to remove internal notes or case numbers. You now have instruction-following pairs that directly represent your task.
User Feedback and Corrections
If your app has any mechanism for users to flag bad outputs or request corrections, that feedback is training gold. A user who says "this response is wrong, the correct answer is X" has given you a labeled example of exactly where your model fails and what the correct behavior looks like.
Even a small volume of correction data is high-signal. Fifty examples of model failures with correct labels will improve a fine-tuned model measurably.
Internal Documentation and Knowledge Bases
For domain Q&A use cases, your product documentation, internal knowledge base, and FAQ content can be converted into training examples. The process:
- Take a section of documentation.
- Generate plausible user questions about that content (manually or with another LLM).
- Write model-quality answers grounded in the documentation.
- Review for accuracy and consistency.
This is slower but produces reliable examples with verified correct answers.
Synthetic Data: The Stanford Alpaca Approach
If you do not have enough existing data, you can generate it. The Stanford Alpaca approach showed this works at scale, and the economics have improved significantly since 2023.
The basic method:
- Write 20-30 seed examples that demonstrate your target task. These should be the best examples you can produce.
- Prompt a capable model (GPT-4o, Claude Opus, etc.) with your seed examples and ask it to generate diverse variations.
- Review the generated examples for quality. Expect to discard 10-20%.
- Use the approved synthetic examples as training data.
The cost for 500-1,000 synthetic examples using current frontier model pricing is typically $50-200, depending on output length and quality review pass count. For most tasks, this is under the cost of a few hours of manual data collection.
One important caveat: synthetic data generated from a closed model (like GPT-4o or Claude) cannot be used to train competing general-purpose models under most providers' terms of service. For narrowly scoped task-specific fine-tuning for your own application, this restriction generally does not apply, but review the applicable terms before proceeding.
The Iterative Approach: Start Small
The biggest mistake in fine-tuning data collection is waiting until you have "enough" data before starting. The right approach is to start with what you have, train, evaluate, and learn.
Here is the recommended progression:
Start with 100 examples. Train a LoRA adapter on your base model. Evaluate on a held-out test set of 20-30 examples. Note where the model fails.
Identify the failure modes. Does the model get the format wrong? Does it fail on certain topic areas? Does it refuse inputs it should answer? Each failure mode tells you exactly what kind of additional training data to collect.
Add targeted examples. Collect 50-100 examples specifically addressing the failure modes you identified. Retrain.
Repeat. In most cases, three to four iterations get you to a production-quality model. In most cases, the final dataset is smaller than 1,000 examples.
This iterative approach works because it targets your data collection efforts at the actual gaps in your model's behavior rather than collecting data uniformly across the input space. You are not building a dataset that covers all possible inputs. You are building a dataset that covers the inputs your users actually send and fixing the cases where your model currently gets it wrong.
The Bottom Line
You almost certainly have enough data to start. You might have 200 support ticket resolutions in a spreadsheet. You have a product knowledge base. You have a few weeks of production conversation logs. You can generate 500 synthetic examples for under $100.
The research is unambiguous: 50 examples can improve style consistency. 100 examples can meaningfully shift a model's behavior. 1,000 well-curated examples can produce a conversational model that holds its own against frontier models on specific tasks.
The developers who never start because they are waiting for a bigger dataset are waiting for a threshold that, for their use case, may not be necessary. The developers who start with 100 examples, evaluate honestly, and iterate are the ones who ship.
Your dataset is smaller than you think. Your model is closer than you think. Start with what you have.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning for App Developers: A Non-ML-Engineer's Guide
A practical guide to fine-tuning AI models for mobile app developers. Learn LoRA, QLoRA, and GGUF export without needing an ML background.

Fine-Tuned 3B vs GPT-4: Why Smaller Models Win at Domain Tasks
Academic research shows fine-tuned 3B-7B models consistently beat GPT-4 on domain-specific tasks. Here's the evidence, the pattern, and how to apply it in your app.

Fine-Tuning vs RAG for Mobile: Why RAG Still Needs a Server
RAG is the go-to solution for giving AI domain knowledge. But on mobile, RAG reintroduces the server dependency you are trying to eliminate. Fine-tuning bakes the knowledge into the model itself.