
Synthetic Data Generation Optimized for Small Model Distillation
When building 0.5B–1B models for mobile NPU deployment, synthetic data quality matters exponentially more than for large models. Here's how to generate, filter, and validate synthetic training data designed for small model distillation.
Building a 0.5B parameter model for mobile NPU deployment is fundamentally different from building a 70B model for cloud inference. The model is 140x smaller. Its tolerance for noisy or misaligned training data is near zero. And synthetic data — which increasingly dominates fine-tuning datasets — must be generated with constraints that most pipelines ignore.
The standard approach to synthetic data is: take a large teacher model, generate examples, use them to train a smaller student. This works passably well when the student is 7B–13B. It falls apart when the student is 0.5B–1B, because the teacher generates text at a complexity level the student fundamentally cannot reproduce.
Here is how to do it differently.
Why Standard Synthetic Data Fails for Sub-1B Models
A 70B teacher model generating a synthetic customer support response might produce a 400-word answer with conditional logic, empathetic framing, multi-step troubleshooting, and a personalized closing. That response is excellent training data for a 13B student model.
For a 0.5B model? It is training data for patterns the model cannot learn. The 0.5B model does not have the parameter capacity to encode conditional empathy, multi-step reasoning, AND domain knowledge simultaneously. It will learn fragments of each and execute none well.
The result: a model that generates responses that start coherently and degrade mid-sentence. Or a model that handles the most common case well and fails catastrophically on edge cases. Or a model that produces grammatically correct text that does not actually answer the question.
These are not model problems. They are data problems.
Teacher Model Selection
Your teacher model defines the quality ceiling for your synthetic data. But bigger is not always better.
For sub-1B targets: Use a 70B+ teacher for generation quality, but add a 7B model as a "complexity filter." If the 7B model cannot reproduce the teacher's output at 80%+ similarity, the example is too complex for the 0.5B student. This two-model filtering approach catches complexity issues that statistical metrics miss.
For 3B–8B targets: A 70B teacher is ideal. The capacity gap is smaller (10x–25x instead of 140x), so the teacher's output is more learnable. You can use broader, more complex examples.
Temperature and sampling. Generate at temperature 0.3–0.5 for sub-1B targets. Higher temperatures produce more diverse but also more complex text. At 0.5B scale, you want consistency over diversity. Save temperature diversity for larger student models.
Filtering Strategies That Actually Matter
Generating synthetic data is the easy part. Filtering it is where the outcome is determined.
Length distribution matching. Measure your production input distribution. If users will send 50–200 token inputs to your on-device model, generate synthetic training inputs in that range. If your production outputs are 20–100 tokens, generate synthetic outputs in that range. Length mismatches are the single most common cause of on-device model failure.
Complexity scoring. Run each synthetic example through the student model (or a model of similar size) and measure perplexity. High-perplexity examples are beyond the model's capacity. Set a perplexity threshold — typically the 75th percentile of a known-good validation set — and discard everything above it.
Domain relevance scoring. Not every example the teacher generates is on-topic. Even with careful prompting, 10–15% of synthetic examples will drift off-domain. Use embedding similarity against a curated set of gold-standard examples to score domain relevance. Discard the bottom 20%.
Deduplication at scale. Synthetic data from large language models tends to be more repetitive than human-generated data. The teacher has modes — common phrasings and structures it defaults to. At 0.5B scale, near-duplicates are particularly harmful because they over-represent certain patterns at the expense of breadth. Use MinHash or SimHash deduplication with a similarity threshold of 0.85.
Format consistency enforcement. If your production model outputs JSON, every training example must output valid JSON. If it outputs classification labels, every example must use the exact label vocabulary. Zero tolerance for format variation at sub-1B scale. One inconsistent example can introduce a failure mode that affects 5% of production outputs.
The Numbers
For a 0.5B–1B model targeting a specific enterprise task:
| Metric | Naive approach | Optimized approach |
|---|---|---|
| Synthetic examples generated | 100,000 | 100,000 |
| After length filtering | 100,000 | 65,000 |
| After complexity filtering | 100,000 | 40,000 |
| After domain relevance | 100,000 | 32,000 |
| After deduplication | 85,000 | 22,000 |
| After format validation | 80,000 | 20,000 |
| Final training set | 80,000 | 20,000 |
| On-device accuracy | 61–68% | 84–91% |
The optimized approach uses 75% fewer training examples and achieves 20–30 percentage points higher accuracy. This is consistent across multiple enterprise domains we have observed. Quality over quantity is not a platitude at sub-1B scale — it is the entire strategy.
The Iterative Loop
Synthetic data generation for small model distillation is not a one-shot process. It is a loop:
- Generate synthetic data with the teacher model
- Filter using the criteria above
- Fine-tune the student model on the filtered dataset
- Deploy on target hardware (actual device, not emulator)
- Measure production-representative metrics
- Analyze failure cases
- Generate targeted synthetic data for failure modes
- Return to step 2
Most teams do steps 1–3 once and ship. The teams that achieve 90%+ accuracy on sub-1B models do 3–5 iterations of this loop, with each iteration targeting the specific failure modes revealed by on-device testing.
Enterprise Data and the On-Premise Requirement
There is a critical constraint that most synthetic data guides ignore: the source knowledge for your synthetic data is enterprise-proprietary.
If you are building a clinical triage model for mobile, the teacher model needs medical knowledge specific to your institution. If you are building a contract analysis model for a law firm's mobile app, the teacher needs examples of that firm's contract patterns.
This means either fine-tuning the teacher on enterprise data (which requires on-premise data prep) or using RAG with enterprise documents during generation (which requires on-premise infrastructure).
Either way, the synthetic data generation pipeline must run on-premise. Your 700GB of clinical notes or legal contracts cannot go to a cloud API to generate synthetic training data for an on-device model.
How Ertas Data Suite Handles This
Ertas Data Suite's Augment module runs synthetic data generation using local LLMs on your own hardware. No data egress. The generation is configured with target model constraints — specify 0.5B target, 512 context window, Q4 quantization — and the synthetic examples are automatically calibrated.
The Clean module provides the filtering pipeline: length distribution analysis, quality scoring, deduplication, and format validation. Every filter decision is logged with full audit trail.
The Export module outputs the filtered, validated dataset as JSONL ready for fine-tuning. Metadata tracks which examples passed which filters, so when you iterate (and you will iterate), you can trace performance improvements back to specific data decisions.
Book a Discovery Call to discuss synthetic data strategies for your on-device AI deployment.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Why Your Fine-Tuning Dataset Won't Work for On-Device AI — And How to Fix It
Most fine-tuning datasets are built for large cloud models. When distilled to 0.5B–1B models for mobile NPUs, the data distribution breaks. Here's why, and how to build datasets that actually work for on-device deployment.

The Cloud-to-Edge AI Pipeline: How Data Prep Fits Between Training and Deployment
The full cloud-to-edge AI pipeline spans raw data through on-device deployment. Data preparation is the step between raw enterprise data and cloud training — and it's where most edge AI projects fail.

From Teacher Model to Edge Device: A Data Prep Workflow for Model Distillation
A step-by-step workflow for preparing training data when your target is an edge device with constrained compute. From defining hardware constraints to validating on-device performance.