
From Teacher Model to Edge Device: A Data Prep Workflow for Model Distillation
A step-by-step workflow for preparing training data when your target is an edge device with constrained compute. From defining hardware constraints to validating on-device performance.
You have enterprise data. You have a target device — a phone with an NPU, a laptop with a neural engine, an edge appliance on a factory floor. You need a small model that performs one specific task well on that device.
The path from enterprise data to deployed edge model has twelve steps. Most guides skip steps 4–8 — the data preparation steps — which is exactly why most edge AI projects underperform.
Here is the complete workflow.
Step 1: Define Target Constraints
Before you touch a single document, define the deployment target in concrete terms.
Hardware specification:
- Device: Snapdragon 8 Gen 3 (Hexagon NPU), Apple A17 Pro (ANE), Intel Core Ultra (NPU), NVIDIA Jetson Orin, or specific edge hardware
- Available memory for model: 2GB, 4GB, 8GB, 16GB
- Compute budget: TOPS (tera operations per second) available for inference
Model size budget:
- 0.5B parameters: fits in ~300MB at Q4, suitable for mobile NPUs
- 1B parameters: fits in ~600MB at Q4, suitable for tablets and phones with ≥6GB RAM
- 3B parameters: fits in ~1.8GB at Q4, suitable for laptops and high-end tablets
- 8B parameters: fits in ~4.5GB at Q4, suitable for laptops with dedicated neural engines
Production parameters:
- Context window: 512, 1024, or 2048 tokens (affects memory and latency)
- Latency budget: 20ms, 50ms, 100ms, 200ms per inference
- Output format: classification label, JSON object, short text, structured extraction
- Throughput: queries per second the device must handle
Document these before proceeding. They shape every subsequent decision.
Step 2: Select the Teacher Model
The teacher model defines your quality ceiling. It generates the synthetic training data that the student will learn from.
For sub-1B student models: Use a 70B+ teacher. The quality gap between teacher and student is large (140x parameter difference), so you need the best possible teacher to maximize knowledge transfer.
For 3B–8B student models: A 30B–70B teacher works well. The smaller gap means a slightly smaller teacher can still produce effective training data.
Teacher model considerations:
- The teacher should be fine-tuned on your domain if possible. A generic 70B model generating synthetic medical data produces less useful examples than a 70B model fine-tuned on clinical text.
- The teacher runs on cloud GPUs during data generation. It does not need to fit on the target device.
- If domain-specific fine-tuning of the teacher is not feasible, use RAG with your enterprise documents during synthetic generation.
Step 3: Generate Synthetic Training Data
Use the teacher model to generate domain-specific training examples. But constrain the generation.
Generation parameters for sub-1B targets:
- Max output length: match student's production context window (e.g., 512 tokens)
- Temperature: 0.3–0.5 (consistency over diversity)
- Reasoning depth: limit to 2–3 step chains
- Output format: identical to production format in every example
Generation parameters for 3B–8B targets:
- Max output length: match student's production context window (e.g., 2048 tokens)
- Temperature: 0.5–0.7 (moderate diversity)
- Reasoning depth: 3–5 step chains
- Output format: consistent with production requirements
Generate 5–10x more examples than you expect to use. Filtering (steps 5–7) will remove 60–80% of generated examples for sub-1B targets.
Step 4: Ingest Enterprise Documents
Your synthetic data generation needs domain grounding. The teacher model must reference your enterprise knowledge.
Ingest raw enterprise documents — PDFs, Word files, scanned documents, database exports, conversation logs — into a structured format that the teacher can reference.
Key considerations:
- Parse documents preserving structure (headings, tables, lists) — not just raw text extraction
- For construction: BOQs, technical drawings, specifications
- For healthcare: clinical notes, discharge summaries, lab reports
- For legal: contracts, pleadings, memoranda
- For finance: financial statements, transaction records, regulatory filings
This step must happen on-premise. Enterprise documents contain sensitive data that cannot be sent to cloud parsing services.
Step 5: Clean and Filter
This is where the distillation-aware data prep diverges most from standard fine-tuning data prep.
Length filtering: Remove examples outside the 10th–90th percentile of your target context window. For a 512-token production context: discard examples shorter than 30 tokens or longer than 450 tokens.
Complexity scoring: Run each example through a model of similar size to your student (or the student model itself if available). Measure perplexity. Discard examples above the 75th percentile — they exceed the student's learning capacity.
Domain relevance scoring: Use embedding similarity against a curated set of 50–100 gold-standard examples. Discard examples below 0.7 cosine similarity.
Deduplication: Apply MinHash with 0.85 similarity threshold. Retain only the highest-quality variant from each cluster.
Format validation: Every example must conform to the exact production output format. One malformed JSON example can introduce a 3–5% failure rate in a sub-1B model.
Expected outcome: 100,000 generated examples → 20,000–40,000 after filtering for sub-1B targets. 100,000 → 50,000–70,000 for 3B–8B targets.
Step 6: Label with Domain Experts
Automated filtering catches distribution issues. It does not catch factual errors, domain-specific inaccuracies, or subtle quality problems that only a subject matter expert would notice.
Domain experts — doctors, lawyers, engineers, analysts — review a sample of the filtered dataset and label for quality:
- Factually correct for this domain?
- Appropriate level of detail for the production task?
- Would this response be acceptable in production?
For sub-1B targets, aim for 100% expert review of at least 2,000 examples from the filtered set. Use these expert-reviewed examples as a validation set.
This step requires a tool that domain experts can use directly — not a Python notebook or command-line interface.
Step 7: Augment
After filtering and expert review, augment the dataset to fill gaps.
Targeted augmentation: Analyze the filtered dataset for underrepresented categories, edge cases, or failure modes. Generate additional synthetic examples specifically targeting these gaps.
Paraphrase generation: For each expert-reviewed example, generate 2–3 paraphrased variants. This increases training data diversity without changing the underlying distribution.
Difficulty calibration: Generate examples at varying difficulty levels within the student model's capacity. Easy examples (80% of training data) build reliable baseline performance. Hard examples (20%) push the capability boundary.
Step 8: Export
Export the final dataset as JSONL formatted for your fine-tuning framework. Include metadata:
- Target model size and architecture
- Target context window
- Target quantization level
- Filter thresholds applied
- Expert review coverage percentage
This metadata enables reproducibility and debugging when iterating.
Step 9: Fine-Tune the Student Model
Train the student model on the prepared dataset using cloud GPUs. Standard fine-tuning process — LoRA or full fine-tuning depending on model size and dataset size.
For sub-1B models: LoRA with rank 16–32 typically works well. Full fine-tuning is feasible given the small model size.
For 3B–8B models: LoRA with rank 32–64 is more practical. Full fine-tuning requires more GPU memory and time.
Step 10: Quantize for Target Hardware
Convert the fine-tuned model to the target precision:
- Q4 (4-bit): smallest size, fastest inference, slight accuracy trade-off
- Q5 (5-bit): moderate balance
- Q8 (8-bit): highest accuracy among quantized formats, larger size
For Qualcomm devices: use Qualcomm AI Hub for optimized quantization and compilation. For Apple: use Core ML tools. For general: ONNX Runtime or llama.cpp quantization.
Step 11: Validate on Target Hardware
Deploy to the actual target device — not an emulator, not a cloud simulation, the real hardware. Measure:
- Task accuracy against a held-out test set
- Inference latency (p50, p95, p99)
- Memory utilization
- Battery impact (for mobile deployments)
- Output format compliance rate
Acceptance criteria: If accuracy is within 5 percentage points of the teacher model on the held-out test set and latency is within the budget, proceed. If not, return to Step 5.
Step 12: Iterate
On-device validation reveals failure modes that cloud benchmarks miss. When performance is below threshold:
- Analyze failure cases from on-device testing
- Categorize failures: data distribution? Complexity? Missing edge cases?
- Return to Step 5 (filter differently) or Step 7 (augment targeting failure modes)
- Re-train, re-quantize, re-validate
Expect 2–3 iterations for 3B–8B targets and 3–5 iterations for sub-1B targets.
Where Ertas Fits
Ertas Data Suite handles Steps 4–8 entirely on-premise. The Ingest module parses enterprise documents. Clean provides distillation-aware filtering. Label enables domain expert review without Python. Augment generates targeted synthetic data. Export produces JSONL with full metadata and audit trail.
Steps 1–3 and 9–12 happen outside Ertas — target definition, teacher model generation, fine-tuning, quantization, and deployment use your existing ML infrastructure. Ertas provides the data preparation layer between raw enterprise data and the training pipeline.
Book a Discovery Call to walk through this workflow with your specific hardware targets and data types.
Turn unstructured data into AI-ready datasets — without it leaving the building.
On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.
Keep reading

Why Your Fine-Tuning Dataset Won't Work for On-Device AI — And How to Fix It
Most fine-tuning datasets are built for large cloud models. When distilled to 0.5B–1B models for mobile NPUs, the data distribution breaks. Here's why, and how to build datasets that actually work for on-device deployment.

Synthetic Data Generation Optimized for Small Model Distillation
When building 0.5B–1B models for mobile NPU deployment, synthetic data quality matters exponentially more than for large models. Here's how to generate, filter, and validate synthetic training data designed for small model distillation.

The Cloud-to-Edge AI Pipeline: How Data Prep Fits Between Training and Deployment
The full cloud-to-edge AI pipeline spans raw data through on-device deployment. Data preparation is the step between raw enterprise data and cloud training — and it's where most edge AI projects fail.