Back to blog
    From Teacher Model to Edge Device: A Data Prep Workflow for Model Distillation
    model-distillationdata-preparationon-device-aifine-tuningworkflowsegment:enterprise

    From Teacher Model to Edge Device: A Data Prep Workflow for Model Distillation

    A step-by-step workflow for preparing training data when your target is an edge device with constrained compute. From defining hardware constraints to validating on-device performance.

    EErtas Team·

    You have enterprise data. You have a target device — a phone with an NPU, a laptop with a neural engine, an edge appliance on a factory floor. You need a small model that performs one specific task well on that device.

    The path from enterprise data to deployed edge model has twelve steps. Most guides skip steps 4–8 — the data preparation steps — which is exactly why most edge AI projects underperform.

    Here is the complete workflow.

    Step 1: Define Target Constraints

    Before you touch a single document, define the deployment target in concrete terms.

    Hardware specification:

    • Device: Snapdragon 8 Gen 3 (Hexagon NPU), Apple A17 Pro (ANE), Intel Core Ultra (NPU), NVIDIA Jetson Orin, or specific edge hardware
    • Available memory for model: 2GB, 4GB, 8GB, 16GB
    • Compute budget: TOPS (tera operations per second) available for inference

    Model size budget:

    • 0.5B parameters: fits in ~300MB at Q4, suitable for mobile NPUs
    • 1B parameters: fits in ~600MB at Q4, suitable for tablets and phones with ≥6GB RAM
    • 3B parameters: fits in ~1.8GB at Q4, suitable for laptops and high-end tablets
    • 8B parameters: fits in ~4.5GB at Q4, suitable for laptops with dedicated neural engines

    Production parameters:

    • Context window: 512, 1024, or 2048 tokens (affects memory and latency)
    • Latency budget: 20ms, 50ms, 100ms, 200ms per inference
    • Output format: classification label, JSON object, short text, structured extraction
    • Throughput: queries per second the device must handle

    Document these before proceeding. They shape every subsequent decision.

    Step 2: Select the Teacher Model

    The teacher model defines your quality ceiling. It generates the synthetic training data that the student will learn from.

    For sub-1B student models: Use a 70B+ teacher. The quality gap between teacher and student is large (140x parameter difference), so you need the best possible teacher to maximize knowledge transfer.

    For 3B–8B student models: A 30B–70B teacher works well. The smaller gap means a slightly smaller teacher can still produce effective training data.

    Teacher model considerations:

    • The teacher should be fine-tuned on your domain if possible. A generic 70B model generating synthetic medical data produces less useful examples than a 70B model fine-tuned on clinical text.
    • The teacher runs on cloud GPUs during data generation. It does not need to fit on the target device.
    • If domain-specific fine-tuning of the teacher is not feasible, use RAG with your enterprise documents during synthetic generation.

    Step 3: Generate Synthetic Training Data

    Use the teacher model to generate domain-specific training examples. But constrain the generation.

    Generation parameters for sub-1B targets:

    • Max output length: match student's production context window (e.g., 512 tokens)
    • Temperature: 0.3–0.5 (consistency over diversity)
    • Reasoning depth: limit to 2–3 step chains
    • Output format: identical to production format in every example

    Generation parameters for 3B–8B targets:

    • Max output length: match student's production context window (e.g., 2048 tokens)
    • Temperature: 0.5–0.7 (moderate diversity)
    • Reasoning depth: 3–5 step chains
    • Output format: consistent with production requirements

    Generate 5–10x more examples than you expect to use. Filtering (steps 5–7) will remove 60–80% of generated examples for sub-1B targets.

    Step 4: Ingest Enterprise Documents

    Your synthetic data generation needs domain grounding. The teacher model must reference your enterprise knowledge.

    Ingest raw enterprise documents — PDFs, Word files, scanned documents, database exports, conversation logs — into a structured format that the teacher can reference.

    Key considerations:

    • Parse documents preserving structure (headings, tables, lists) — not just raw text extraction
    • For construction: BOQs, technical drawings, specifications
    • For healthcare: clinical notes, discharge summaries, lab reports
    • For legal: contracts, pleadings, memoranda
    • For finance: financial statements, transaction records, regulatory filings

    This step must happen on-premise. Enterprise documents contain sensitive data that cannot be sent to cloud parsing services.

    Step 5: Clean and Filter

    This is where the distillation-aware data prep diverges most from standard fine-tuning data prep.

    Length filtering: Remove examples outside the 10th–90th percentile of your target context window. For a 512-token production context: discard examples shorter than 30 tokens or longer than 450 tokens.

    Complexity scoring: Run each example through a model of similar size to your student (or the student model itself if available). Measure perplexity. Discard examples above the 75th percentile — they exceed the student's learning capacity.

    Domain relevance scoring: Use embedding similarity against a curated set of 50–100 gold-standard examples. Discard examples below 0.7 cosine similarity.

    Deduplication: Apply MinHash with 0.85 similarity threshold. Retain only the highest-quality variant from each cluster.

    Format validation: Every example must conform to the exact production output format. One malformed JSON example can introduce a 3–5% failure rate in a sub-1B model.

    Expected outcome: 100,000 generated examples → 20,000–40,000 after filtering for sub-1B targets. 100,000 → 50,000–70,000 for 3B–8B targets.

    Step 6: Label with Domain Experts

    Automated filtering catches distribution issues. It does not catch factual errors, domain-specific inaccuracies, or subtle quality problems that only a subject matter expert would notice.

    Domain experts — doctors, lawyers, engineers, analysts — review a sample of the filtered dataset and label for quality:

    • Factually correct for this domain?
    • Appropriate level of detail for the production task?
    • Would this response be acceptable in production?

    For sub-1B targets, aim for 100% expert review of at least 2,000 examples from the filtered set. Use these expert-reviewed examples as a validation set.

    This step requires a tool that domain experts can use directly — not a Python notebook or command-line interface.

    Step 7: Augment

    After filtering and expert review, augment the dataset to fill gaps.

    Targeted augmentation: Analyze the filtered dataset for underrepresented categories, edge cases, or failure modes. Generate additional synthetic examples specifically targeting these gaps.

    Paraphrase generation: For each expert-reviewed example, generate 2–3 paraphrased variants. This increases training data diversity without changing the underlying distribution.

    Difficulty calibration: Generate examples at varying difficulty levels within the student model's capacity. Easy examples (80% of training data) build reliable baseline performance. Hard examples (20%) push the capability boundary.

    Step 8: Export

    Export the final dataset as JSONL formatted for your fine-tuning framework. Include metadata:

    • Target model size and architecture
    • Target context window
    • Target quantization level
    • Filter thresholds applied
    • Expert review coverage percentage

    This metadata enables reproducibility and debugging when iterating.

    Step 9: Fine-Tune the Student Model

    Train the student model on the prepared dataset using cloud GPUs. Standard fine-tuning process — LoRA or full fine-tuning depending on model size and dataset size.

    For sub-1B models: LoRA with rank 16–32 typically works well. Full fine-tuning is feasible given the small model size.

    For 3B–8B models: LoRA with rank 32–64 is more practical. Full fine-tuning requires more GPU memory and time.

    Step 10: Quantize for Target Hardware

    Convert the fine-tuned model to the target precision:

    • Q4 (4-bit): smallest size, fastest inference, slight accuracy trade-off
    • Q5 (5-bit): moderate balance
    • Q8 (8-bit): highest accuracy among quantized formats, larger size

    For Qualcomm devices: use Qualcomm AI Hub for optimized quantization and compilation. For Apple: use Core ML tools. For general: ONNX Runtime or llama.cpp quantization.

    Step 11: Validate on Target Hardware

    Deploy to the actual target device — not an emulator, not a cloud simulation, the real hardware. Measure:

    • Task accuracy against a held-out test set
    • Inference latency (p50, p95, p99)
    • Memory utilization
    • Battery impact (for mobile deployments)
    • Output format compliance rate

    Acceptance criteria: If accuracy is within 5 percentage points of the teacher model on the held-out test set and latency is within the budget, proceed. If not, return to Step 5.

    Step 12: Iterate

    On-device validation reveals failure modes that cloud benchmarks miss. When performance is below threshold:

    1. Analyze failure cases from on-device testing
    2. Categorize failures: data distribution? Complexity? Missing edge cases?
    3. Return to Step 5 (filter differently) or Step 7 (augment targeting failure modes)
    4. Re-train, re-quantize, re-validate

    Expect 2–3 iterations for 3B–8B targets and 3–5 iterations for sub-1B targets.

    Where Ertas Fits

    Ertas Data Suite handles Steps 4–8 entirely on-premise. The Ingest module parses enterprise documents. Clean provides distillation-aware filtering. Label enables domain expert review without Python. Augment generates targeted synthetic data. Export produces JSONL with full metadata and audit trail.

    Steps 1–3 and 9–12 happen outside Ertas — target definition, teacher model generation, fine-tuning, quantization, and deployment use your existing ML infrastructure. Ertas provides the data preparation layer between raw enterprise data and the training pipeline.

    Book a Discovery Call to walk through this workflow with your specific hardware targets and data types.

    Turn unstructured data into AI-ready datasets — without it leaving the building.

    On-premise data preparation with full audit trail. No data egress. No fragmented toolchains. EU AI Act Article 30 compliance built in.

    Keep reading