Synthetic Data Generation Optimized for Small Model Distillation

Building a 0.5B parameter model for mobile NPU deployment is fundamentally different from building a 70B model for cloud inference. The model is 140x smaller. Its tolerance for noisy or misaligned training data is near zero. And synthetic data — which increasingly dominates fine-tuning datasets — must be generated with constraints that most pipelines ignore.

The standard approach to synthetic data is: take a large teacher model, generate examples, use them to train a smaller student. This works passably well when the student is 7B–13B. It falls apart when the student is 0.5B–1B, because the teacher generates text at a complexity level the student fundamentally cannot reproduce.

Here is how to do it differently.

Why Standard Synthetic Data Fails for Sub-1B Models

A 70B teacher model generating a synthetic customer support response might produce a 400-word answer with conditional logic, empathetic framing, multi-step troubleshooting, and a personalized closing. That response is excellent training data for a 13B student model.

For a 0.5B model? It is training data for patterns the model cannot learn. The 0.5B model does not have the parameter capacity to encode conditional empathy, multi-step reasoning, AND domain knowledge simultaneously. It will learn fragments of each and execute none well.

The result: a model that generates responses that start coherently and degrade mid-sentence. Or a model that handles the most common case well and fails catastrophically on edge cases. Or a model that produces grammatically correct text that does not actually answer the question.

These are not model problems. They are data problems.

Teacher Model Selection

Your teacher model defines the quality ceiling for your synthetic data. But bigger is not always better.

For sub-1B targets: Use a 70B+ teacher for generation quality, but add a 7B model as a "complexity filter." If the 7B model cannot reproduce the teacher's output at 80%+ similarity, the example is too complex for the 0.5B student. This two-model filtering approach catches complexity issues that statistical metrics miss.

For 3B–8B targets: A 70B teacher is ideal. The capacity gap is smaller (10x–25x instead of 140x), so the teacher's output is more learnable. You can use broader, more complex examples.

Temperature and sampling. Generate at temperature 0.3–0.5 for sub-1B targets. Higher temperatures produce more diverse but also more complex text. At 0.5B scale, you want consistency over diversity. Save temperature diversity for larger student models.

Filtering Strategies That Actually Matter

Generating synthetic data is the easy part. Filtering it is where the outcome is determined.

Length distribution matching. Measure your production input distribution. If users will send 50–200 token inputs to your on-device model, generate synthetic training inputs in that range. If your production outputs are 20–100 tokens, generate synthetic outputs in that range. Length mismatches are the single most common cause of on-device model failure.

Complexity scoring. Run each synthetic example through the student model (or a model of similar size) and measure perplexity. High-perplexity examples are beyond the model's capacity. Set a perplexity threshold — typically the 75th percentile of a known-good validation set — and discard everything above it.

Domain relevance scoring. Not every example the teacher generates is on-topic. Even with careful prompting, 10–15% of synthetic examples will drift off-domain. Use embedding similarity against a curated set of gold-standard examples to score domain relevance. Discard the bottom 20%.

Deduplication at scale. Synthetic data from large language models tends to be more repetitive than human-generated data. The teacher has modes — common phrasings and structures it defaults to. At 0.5B scale, near-duplicates are particularly harmful because they over-represent certain patterns at the expense of breadth. Use MinHash or SimHash deduplication with a similarity threshold of 0.85.

Format consistency enforcement. If your production model outputs JSON, every training example must output valid JSON. If it outputs classification labels, every example must use the exact label vocabulary. Zero tolerance for format variation at sub-1B scale. One inconsistent example can introduce a failure mode that affects 5% of production outputs.

The Numbers

For a 0.5B–1B model targeting a specific enterprise task:

Metric	Naive approach	Optimized approach
Synthetic examples generated	100,000	100,000
After length filtering	100,000	65,000
After complexity filtering	100,000	40,000
After domain relevance	100,000	32,000
After deduplication	85,000	22,000
After format validation	80,000	20,000
Final training set	80,000	20,000
On-device accuracy	61–68%	84–91%

The optimized approach uses 75% fewer training examples and achieves 20–30 percentage points higher accuracy. This is consistent across multiple enterprise domains we have observed. Quality over quantity is not a platitude at sub-1B scale — it is the entire strategy.

The Iterative Loop

Synthetic data generation for small model distillation is not a one-shot process. It is a loop:

Generate synthetic data with the teacher model
Filter using the criteria above
Fine-tune the student model on the filtered dataset
Deploy on target hardware (actual device, not emulator)
Measure production-representative metrics
Analyze failure cases
Generate targeted synthetic data for failure modes
Return to step 2

Most teams do steps 1–3 once and ship. The teams that achieve 90%+ accuracy on sub-1B models do 3–5 iterations of this loop, with each iteration targeting the specific failure modes revealed by on-device testing.

Enterprise Data and the On-Premise Requirement

There is a critical constraint that most synthetic data guides ignore: the source knowledge for your synthetic data is enterprise-proprietary.

If you are building a clinical triage model for mobile, the teacher model needs medical knowledge specific to your institution. If you are building a contract analysis model for a law firm's mobile app, the teacher needs examples of that firm's contract patterns.

This means either fine-tuning the teacher on enterprise data (which requires on-premise data prep) or using RAG with enterprise documents during generation (which requires on-premise infrastructure).

Either way, the synthetic data generation pipeline must run on-premise. Your 700GB of clinical notes or legal contracts cannot go to a cloud API to generate synthetic training data for an on-device model.

How Ertas Data Suite Handles This

Ertas Data Suite's Augment module runs synthetic data generation using local LLMs on your own hardware. No data egress. The generation is configured with target model constraints — specify 0.5B target, 512 context window, Q4 quantization — and the synthetic examples are automatically calibrated.

The Clean module provides the filtering pipeline: length distribution analysis, quality scoring, deduplication, and format validation. Every filter decision is logged with full audit trail.

The Export module outputs the filtered, validated dataset as JSONL ready for fine-tuning. Metadata tracks which examples passed which filters, so when you iterate (and you will iterate), you can trace performance improvements back to specific data decisions.

Book a Discovery Call to discuss synthetic data strategies for your on-device AI deployment.