What is Model Distillation?

    A technique for transferring knowledge from a large, capable 'teacher' model to a smaller, faster 'student' model, producing compact models that approach the teacher's performance on specific tasks at a fraction of the inference cost.

    Definition

    Model distillation (also called knowledge distillation or KD) is a model compression technique where a smaller 'student' model is trained to replicate the behavior of a larger 'teacher' model. Rather than training the student on raw ground-truth labels, it learns from the teacher's output distributions — including the soft probabilities across all possible tokens, which contain richer information about the teacher's learned representations than hard labels alone.

    In the LLM era, distillation has evolved beyond the original Hinton et al. (2015) formulation. Modern approaches often use the teacher model to generate synthetic training data: you run a large frontier model (GPT-4, Claude, Llama 405B) on your task, collect the input-output pairs, and fine-tune a smaller model (7B-14B parameters) on these examples. This 'data distillation' or 'API distillation' approach doesn't require access to the teacher's weights or logits — just its outputs — making it practical even when the teacher is a closed-source API.

    Combined with parameter-efficient fine-tuning methods like LoRA, distillation enables organizations to create compact, task-specific models that run on consumer hardware while retaining 85-95% of the teacher model's quality on the target task. The resulting models are cheaper to serve, faster at inference, and can be deployed locally for privacy-sensitive applications.

    Why It Matters

    Frontier models like GPT-4 and Claude are expensive to run and require cloud API access. Distillation lets organizations capture most of that capability in a model they own and can deploy anywhere. This has profound implications for cost (inference on a 7B model is 10-100x cheaper than API calls to a frontier model), latency (local inference eliminates network round-trips), privacy (data never leaves your infrastructure), and reliability (no API rate limits or deprecation cycles). For ML engineers, distillation is the primary technique for converting expensive cloud AI dependencies into owned, deployable assets.

    How It Works

    The modern LLM distillation workflow typically follows these steps: (1) Define your target task and collect representative input prompts, (2) Run these prompts through a large teacher model to generate high-quality outputs, (3) Curate the teacher outputs — filter for quality, remove hallucinations, format consistently, (4) Fine-tune a smaller student model on the curated input-output pairs using LoRA or QLoRA, (5) Evaluate the student against a held-out test set, comparing to both the teacher and any existing baseline, (6) Iterate on data quality and training hyperparameters until the quality gap is acceptable, (7) Export the student model in a deployment-ready format like GGUF. The key insight is that data quality matters more than data quantity — 5,000 high-quality distilled examples often outperform 50,000 noisy ones.

    python
    # Step 1: Generate teacher outputs for distillation
    from openai import OpenAI
    import json
    
    client = OpenAI()
    training_pairs = []
    
    for prompt in task_prompts:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Categorize this transaction..."},
                {"role": "user", "content": prompt}
            ]
        )
        training_pairs.append({
            "messages": [
                {"role": "system", "content": "Categorize this transaction..."},
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": response.choices[0].message.content}
            ]
        })
    
    # Step 2: Save as JSONL for fine-tuning in Ertas Studio
    with open("distillation_data.jsonl", "w") as f:
        for pair in training_pairs:
            f.write(json.dumps(pair) + "\n")
    Generating synthetic training data from a GPT-4o teacher model for distillation into a smaller student model via Ertas Studio.

    Example Use Case

    An ML engineer at a fintech company wants to replace their GPT-4 dependency for transaction categorization. They collect 10,000 representative transactions, run them through GPT-4 to generate categorizations with explanations, curate the outputs to remove errors, and fine-tune Qwen 2.5 7B with QLoRA on the resulting dataset. The distilled model achieves 93% agreement with GPT-4 on a held-out test set (vs. 89% for the base Qwen model with prompting alone). Inference cost drops from $0.003 per transaction to effectively zero on local hardware, and latency drops from 800ms to 50ms.

    Key Takeaways

    • Model distillation transfers knowledge from a large teacher model to a smaller, deployable student model.
    • Modern LLM distillation often works through synthetic data generation rather than logit matching.
    • Distilled 7B models can achieve 85-95% of frontier model quality on specific tasks at 10-100x lower inference cost.
    • Data quality matters more than quantity — curated teacher outputs produce better students.
    • Combined with LoRA and GGUF export, distillation enables local deployment of frontier-quality task-specific models.

    How Ertas Helps

    Distillation is a core workflow in Ertas Studio. Users can upload teacher-generated datasets to Vault, fine-tune student models using the visual pipeline builder with LoRA or QLoRA, compare distilled model quality against the teacher using built-in evaluation tools, and export the final student model as GGUF for local deployment. The entire distillation pipeline — from data preparation to deployment-ready model — runs within a single platform.

    Related Resources

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.