
Synthetic Data for Fine-Tuning: How to Generate Training Data That Actually Works
A practical guide to generating synthetic training data for fine-tuning — covering prompt strategies, quality filtering, distribution matching, and the 80/20 rule for mixing real and synthetic data.
Fine-tuning is limited by data. Not compute, not model architecture, not hyperparameter tuning. Data. Specifically, the lack of enough high-quality labeled examples to teach a smaller model a specific task.
Most teams that attempt fine-tuning discover this within the first week. They have 50 to 200 real examples, they need 1,000 to 5,000, and collecting more at the quality level required would take months. This is the data bottleneck, and it is the single most common reason fine-tuning projects stall or get abandoned.
Synthetic data generation offers a way through. You use a frontier model — GPT-4o, Claude 3.5, Gemini — to generate the training examples your smaller model will learn from. Done well, synthetic data can get you 80-90% of the performance of an equivalent human-curated dataset at 1/100th the cost and 1/1000th the time.
Done poorly, it produces a dataset that teaches your model to hallucinate confidently. The difference comes down to technique.
Why Synthetic Data Works (And When It Does Not)
The core insight behind synthetic data is straightforward: frontier models already know how to perform most narrow tasks well. They are just too expensive, too slow, or too risky (privacy, vendor lock-in) to use in production at scale. Fine-tuning transfers that capability into a smaller, cheaper, locally-runnable model.
Synthetic data works best when:
- The task is well-defined. Classification, extraction, summarization, format conversion — tasks with clear right and wrong answers.
- The frontier model performs the task well. If GPT-4o cannot do the task reliably, its generated examples will not teach your model to do it either.
- You have some real examples to anchor generation. Even 20-50 real examples dramatically improve the quality of synthetic data by providing concrete patterns to expand from.
Synthetic data struggles when:
- The task requires rare domain knowledge the frontier model does not have (e.g., internal company jargon, proprietary classification schemes).
- Output quality is highly subjective with no clear evaluation criteria.
- The distribution of real inputs is unusual — synthetic data tends toward the "average" case and underrepresents edge cases.
Strategy 1: Direct Prompting
The simplest approach. You describe the task and ask the frontier model to generate input-output pairs.
Generate 25 customer support emails about subscription billing issues.
For each email, provide:
1. The email text (vary length from 2-8 sentences, vary tone from frustrated to polite)
2. The correct category: "billing_error", "cancellation_request", "upgrade_inquiry", "refund_request", "payment_method_update"
3. The priority level: "low", "medium", "high"
Format as JSON. Make examples diverse — include typos, different writing styles, and varying levels of detail.
Cost: Generating 100 examples with GPT-4o costs roughly $0.15-0.30 depending on output length. At scale, 5,000 examples runs $8-15.
Strengths: Fast to set up, easy to iterate on the prompt, good for well-understood tasks.
Weaknesses: Output diversity plateaus quickly. After 200-300 examples from the same prompt, you start getting repetitive patterns. The model converges on a narrow band of "typical" examples.
Fix the diversity problem: Batch your generation in sets of 25-50, and vary the prompt each time. Change the constraints, specify different personas, require different edge cases. Five prompts generating 200 examples each will produce more diverse data than one prompt generating 1,000.
Strategy 2: Seed Expansion
Start with your real examples and use them as seeds for generating more. This is the highest-ROI strategy if you have even a small set of real data.
Here are 5 real customer support emails from our system:
[paste 5 real examples]
Generate 25 new examples that match the style, complexity, and topic distribution of these real examples. Maintain the same format. Include examples that are:
- Similar difficulty to the originals
- Slightly harder edge cases
- Shorter and more ambiguous versions
- Longer and more detailed versions
Cost: Slightly higher per example due to the seed context (~$0.20-0.40 per 100 examples with GPT-4o). For 5,000 examples: $10-20.
Strengths: Output distribution matches your real data much more closely. The model picks up on patterns, vocabulary, and edge cases that are hard to describe in a prompt.
Weaknesses: If your seed examples are biased or unrepresentative, the bias gets amplified. Always use a diverse set of seeds.
Best practice: Rotate your seed set. If you have 50 real examples, randomly sample 5-10 different seeds for each batch. This prevents the model from over-indexing on any single example's patterns.
Strategy 3: Chain-of-Thought Extraction
For tasks where reasoning matters — not just the final answer — generate examples that include the reasoning process. This produces better training data for tasks like classification with explanations, multi-step extraction, or decision-making.
Task: Classify this legal clause and explain why.
Clause: "The Licensee shall not reverse-engineer, decompile, or disassemble the Software."
Think step by step:
1. This clause restricts specific actions (reverse-engineering, decompiling, disassembling)
2. These restrictions apply to "the Licensee" — the party receiving the license
3. This is a standard IP protection clause found in software license agreements
4. Classification: "intellectual_property_restriction"
Now generate 20 similar examples with full reasoning chains. Vary the clause types across: IP restrictions, liability limitations, termination conditions, payment terms, confidentiality obligations.
Cost: Higher per example because outputs are longer (~$0.40-0.80 per 100 examples). For 5,000 examples: $20-40.
Strengths: Models trained on chain-of-thought data are more reliable and more interpretable. They learn the reasoning pattern, not just the answer.
Weaknesses: Slower to generate and more expensive. The reasoning chains themselves need quality review — a plausible-sounding but incorrect reasoning chain is worse than no reasoning at all.
Strategy 4: Persona-Based Generation
Generate data from multiple simulated personas to maximize diversity. This is particularly effective for customer-facing tasks where inputs come from different types of users.
You are a frustrated small business owner who has been double-charged on your monthly subscription. Write a support email. Be specific about the amount ($49.99) and include at least one emotional statement. Keep it under 100 words.
Then vary the persona:
You are a polite enterprise IT admin submitting a ticket on behalf of a team member. Write a formal support email about the same billing issue. Use professional language and reference a ticket number.
Cost: Similar to direct prompting. 5,000 examples: $8-15.
Strengths: Produces genuinely diverse data because each persona brings different vocabulary, tone, structure, and concerns. This diversity is exactly what prevents your fine-tuned model from learning a narrow "synthetic" style.
Weaknesses: Requires more prompt engineering effort upfront. You need to define 10-20 distinct personas to get meaningful diversity.
The Quality Filtering Pipeline
Raw synthetic data is not training data. It needs filtering. Expect to discard 15-30% of generated examples.
Step 1: Format validation. Parse every example. Reject anything that does not match your expected schema (missing fields, wrong types, malformed JSON). This catches 5-10% of examples.
Step 2: Deduplication. Use embedding similarity (cosine similarity > 0.95) to find near-duplicates. Synthetic data has a strong tendency toward repetition, especially in later batches. This typically removes another 5-10%.
Step 3: Label verification. For classification tasks, run a sample of generated examples through a second frontier model and check agreement. If the two models disagree on the label, flag the example for review. Agreement rate should be >90%; if it is lower, your task definition may be ambiguous.
Step 4: Difficulty distribution. Check that your dataset includes easy, medium, and hard examples. Synthetic data skews toward medium difficulty. If >70% of your examples are "obvious" classifications, intentionally generate more edge cases.
Step 5: Length and complexity distribution. Plot the distribution of input lengths and compare to your real data. Synthetic inputs tend to cluster around a modal length. Add explicit length constraints to your generation prompts to spread the distribution.
Distribution Matching: The Step Most People Skip
Your synthetic data needs to match the distribution of inputs your model will see in production. This means matching:
- Category distribution. If 40% of real support tickets are billing-related, 40% of your training data should be too. Do not generate equal examples per category unless your production traffic is actually balanced (it almost never is).
- Input complexity distribution. If real inputs range from 10 to 500 words, your synthetic data should cover that range proportionally.
- Edge case frequency. If 5% of real inputs are ambiguous or multi-category, include that proportion in training data.
The easiest way to do this: analyze your existing real data (even if it is small) to estimate the distribution, then set explicit quotas for each category and complexity level in your generation prompts.
The 80/20 Rule: Mixing Real and Synthetic Data
The most effective training datasets are not purely synthetic. They mix real and synthetic data. The ratio that works consistently well across tasks:
- 80% synthetic, 20% real for tasks where you have limited real data (50-200 examples)
- 50/50 when you have more real data (500+)
- 20% synthetic, 80% real when you have abundant real data and use synthetic only to fill gaps in underrepresented categories
Why include real data at all? Real examples anchor the model to actual production patterns. They contain the noise, imperfections, and edge cases that synthetic data systematically underrepresents. A model trained on 100% synthetic data performs well on synthetic-like inputs but degrades on the messiness of real-world data.
Practical recommendation: Start with 80/20. Fine-tune, evaluate on a held-out set of real examples only (never evaluate on synthetic data — that is circular). If performance is below target, increase the real data proportion rather than generating more synthetic data. More synthetic data has diminishing returns; more real data almost always helps.
Cost Breakdown: What This Actually Costs
Generating a complete fine-tuning dataset with synthetic data is remarkably cheap:
| Dataset Size | GPT-4o Cost | Claude 3.5 Cost | Time (wall clock) |
|---|---|---|---|
| 500 examples | $1-3 | $1-2 | 15-30 min |
| 1,000 examples | $2-5 | $2-4 | 30-60 min |
| 5,000 examples | $8-20 | $8-15 | 2-4 hours |
| 10,000 examples | $15-40 | $15-30 | 4-8 hours |
These costs assume medium-length inputs and outputs (100-300 tokens each). Tasks with longer outputs (e.g., summarization, long-form generation) will cost 2-3x more.
For comparison, human annotation at $0.50-2.00 per example (typical for platforms like Scale AI or Labelbox):
| Dataset Size | Human Annotation Cost | Time |
|---|---|---|
| 1,000 examples | $500-2,000 | 1-4 weeks |
| 5,000 examples | $2,500-10,000 | 4-12 weeks |
Synthetic data is not free — but it is 100-500x cheaper than human annotation for most tasks.
How Ertas Vault Fits In
Ertas Vault is where your synthetic data workflow lives. Upload your seed examples, run generation pipelines, apply automated quality filters, and manage versioned datasets — all without sending data to external services.
The key advantages for synthetic data specifically:
- Seed management. Tag and organize your real examples. Sample diverse seeds automatically for each generation batch.
- Quality pipeline. Built-in deduplication, format validation, and distribution analysis. See where your dataset is thin before you generate more.
- Version control. Track every generation run. Compare model performance across dataset versions. Roll back if a batch degrades quality.
- Privacy-first. Your real seed data never leaves your infrastructure. Generate synthetic data locally or through your own API keys. No third-party data processing agreements required.
The data bottleneck is real, but it is solvable. Synthetic data generation is not a shortcut — it is a legitimate engineering technique that, when done with proper quality controls, produces fine-tuning datasets that work.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Related Reading
- Synthetic Data Generation for Fine-Tuning: Techniques That Work — deeper dive into specific generation techniques and quality signals
- How to Fine-Tune an LLM: The Complete Practical Guide — end-to-end fine-tuning walkthrough from data preparation to deployment
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

From Prompt Engineering to Fine-Tuning: The Migration Playbook
A practical playbook for teams migrating from prompt engineering to fine-tuning — when to make the switch, how to convert prompts into training data, and the step-by-step migration process.

Synthetic Data Generation in Air-Gapped Environments for Fine-Tuning
How to generate synthetic training data in air-gapped environments — covering paraphrasing, instruction generation, DPO pairs, and seed expansion using local LLMs only.

Model Distillation Explained: Run Sonnet-Quality Output on a $0 Inference Bill
A complete guide to model distillation — how to transfer capabilities from large frontier models like Claude Sonnet into small local models, achieving comparable quality at zero ongoing inference cost.