From Teacher Model to Edge Device: A Data Prep Workflow for Model Distillation

You have enterprise data. You have a target device — a phone with an NPU, a laptop with a neural engine, an edge appliance on a factory floor. You need a small model that performs one specific task well on that device.

The path from enterprise data to deployed edge model has twelve steps. Most guides skip steps 4–8 — the data preparation steps — which is exactly why most edge AI projects underperform.

Here is the complete workflow.

Step 1: Define Target Constraints

Before you touch a single document, define the deployment target in concrete terms.

Hardware specification:

Device: Snapdragon 8 Gen 3 (Hexagon NPU), Apple A17 Pro (ANE), Intel Core Ultra (NPU), NVIDIA Jetson Orin, or specific edge hardware
Available memory for model: 2GB, 4GB, 8GB, 16GB
Compute budget: TOPS (tera operations per second) available for inference

Model size budget:

0.5B parameters: fits in ~300MB at Q4, suitable for mobile NPUs
1B parameters: fits in ~600MB at Q4, suitable for tablets and phones with ≥6GB RAM
3B parameters: fits in ~1.8GB at Q4, suitable for laptops and high-end tablets
8B parameters: fits in ~4.5GB at Q4, suitable for laptops with dedicated neural engines

Production parameters:

Context window: 512, 1024, or 2048 tokens (affects memory and latency)
Latency budget: 20ms, 50ms, 100ms, 200ms per inference
Output format: classification label, JSON object, short text, structured extraction
Throughput: queries per second the device must handle

Document these before proceeding. They shape every subsequent decision.

Step 2: Select the Teacher Model

The teacher model defines your quality ceiling. It generates the synthetic training data that the student will learn from.

For sub-1B student models: Use a 70B+ teacher. The quality gap between teacher and student is large (140x parameter difference), so you need the best possible teacher to maximize knowledge transfer.

For 3B–8B student models: A 30B–70B teacher works well. The smaller gap means a slightly smaller teacher can still produce effective training data.

Teacher model considerations:

The teacher should be fine-tuned on your domain if possible. A generic 70B model generating synthetic medical data produces less useful examples than a 70B model fine-tuned on clinical text.
The teacher runs on cloud GPUs during data generation. It does not need to fit on the target device.
If domain-specific fine-tuning of the teacher is not feasible, use RAG with your enterprise documents during synthetic generation.

Step 3: Generate Synthetic Training Data

Use the teacher model to generate domain-specific training examples. But constrain the generation.

Generation parameters for sub-1B targets:

Max output length: match student's production context window (e.g., 512 tokens)
Temperature: 0.3–0.5 (consistency over diversity)
Reasoning depth: limit to 2–3 step chains
Output format: identical to production format in every example

Generation parameters for 3B–8B targets:

Max output length: match student's production context window (e.g., 2048 tokens)
Temperature: 0.5–0.7 (moderate diversity)
Reasoning depth: 3–5 step chains
Output format: consistent with production requirements

Generate 5–10x more examples than you expect to use. Filtering (steps 5–7) will remove 60–80% of generated examples for sub-1B targets.

Step 4: Ingest Enterprise Documents

Your synthetic data generation needs domain grounding. The teacher model must reference your enterprise knowledge.

Ingest raw enterprise documents — PDFs, Word files, scanned documents, database exports, conversation logs — into a structured format that the teacher can reference.

Key considerations:

Parse documents preserving structure (headings, tables, lists) — not just raw text extraction
For construction: BOQs, technical drawings, specifications
For healthcare: clinical notes, discharge summaries, lab reports
For legal: contracts, pleadings, memoranda
For finance: financial statements, transaction records, regulatory filings

This step must happen on-premise. Enterprise documents contain sensitive data that cannot be sent to cloud parsing services.

Step 5: Clean and Filter

This is where the distillation-aware data prep diverges most from standard fine-tuning data prep.

Length filtering: Remove examples outside the 10th–90th percentile of your target context window. For a 512-token production context: discard examples shorter than 30 tokens or longer than 450 tokens.

Complexity scoring: Run each example through a model of similar size to your student (or the student model itself if available). Measure perplexity. Discard examples above the 75th percentile — they exceed the student's learning capacity.

Domain relevance scoring: Use embedding similarity against a curated set of 50–100 gold-standard examples. Discard examples below 0.7 cosine similarity.

Deduplication: Apply MinHash with 0.85 similarity threshold. Retain only the highest-quality variant from each cluster.

Format validation: Every example must conform to the exact production output format. One malformed JSON example can introduce a 3–5% failure rate in a sub-1B model.

Expected outcome: 100,000 generated examples → 20,000–40,000 after filtering for sub-1B targets. 100,000 → 50,000–70,000 for 3B–8B targets.

Step 6: Label with Domain Experts

Automated filtering catches distribution issues. It does not catch factual errors, domain-specific inaccuracies, or subtle quality problems that only a subject matter expert would notice.

Domain experts — doctors, lawyers, engineers, analysts — review a sample of the filtered dataset and label for quality:

Factually correct for this domain?
Appropriate level of detail for the production task?
Would this response be acceptable in production?

For sub-1B targets, aim for 100% expert review of at least 2,000 examples from the filtered set. Use these expert-reviewed examples as a validation set.

This step requires a tool that domain experts can use directly — not a Python notebook or command-line interface.

Step 7: Augment

After filtering and expert review, augment the dataset to fill gaps.

Targeted augmentation: Analyze the filtered dataset for underrepresented categories, edge cases, or failure modes. Generate additional synthetic examples specifically targeting these gaps.

Paraphrase generation: For each expert-reviewed example, generate 2–3 paraphrased variants. This increases training data diversity without changing the underlying distribution.

Difficulty calibration: Generate examples at varying difficulty levels within the student model's capacity. Easy examples (80% of training data) build reliable baseline performance. Hard examples (20%) push the capability boundary.

Step 8: Export

Export the final dataset as JSONL formatted for your fine-tuning framework. Include metadata:

Target model size and architecture
Target context window
Target quantization level
Filter thresholds applied
Expert review coverage percentage

This metadata enables reproducibility and debugging when iterating.

Step 9: Fine-Tune the Student Model

Train the student model on the prepared dataset using cloud GPUs. Standard fine-tuning process — LoRA or full fine-tuning depending on model size and dataset size.

For sub-1B models: LoRA with rank 16–32 typically works well. Full fine-tuning is feasible given the small model size.

For 3B–8B models: LoRA with rank 32–64 is more practical. Full fine-tuning requires more GPU memory and time.

Step 10: Quantize for Target Hardware

Convert the fine-tuned model to the target precision:

Q4 (4-bit): smallest size, fastest inference, slight accuracy trade-off
Q5 (5-bit): moderate balance
Q8 (8-bit): highest accuracy among quantized formats, larger size

For Qualcomm devices: use Qualcomm AI Hub for optimized quantization and compilation. For Apple: use Core ML tools. For general: ONNX Runtime or llama.cpp quantization.

Step 11: Validate on Target Hardware

Deploy to the actual target device — not an emulator, not a cloud simulation, the real hardware. Measure:

Task accuracy against a held-out test set
Inference latency (p50, p95, p99)
Memory utilization
Battery impact (for mobile deployments)
Output format compliance rate

Acceptance criteria: If accuracy is within 5 percentage points of the teacher model on the held-out test set and latency is within the budget, proceed. If not, return to Step 5.

Step 12: Iterate

On-device validation reveals failure modes that cloud benchmarks miss. When performance is below threshold:

Analyze failure cases from on-device testing
Categorize failures: data distribution? Complexity? Missing edge cases?
Return to Step 5 (filter differently) or Step 7 (augment targeting failure modes)
Re-train, re-quantize, re-validate

Expect 2–3 iterations for 3B–8B targets and 3–5 iterations for sub-1B targets.

Where Ertas Fits

Ertas Data Suite handles Steps 4–8 entirely on-premise. The Ingest module parses enterprise documents. Clean provides distillation-aware filtering. Label enables domain expert review without Python. Augment generates targeted synthetic data. Export produces JSONL with full metadata and audit trail.

Steps 1–3 and 9–12 happen outside Ertas — target definition, teacher model generation, fine-tuning, quantization, and deployment use your existing ML infrastructure. Ertas provides the data preparation layer between raw enterprise data and the training pipeline.

Book a Discovery Call to walk through this workflow with your specific hardware targets and data types.