What is Knowledge Distillation?

A model compression technique where a smaller 'student' model is trained to replicate the behavior of a larger, more capable 'teacher' model.

Definition

Knowledge distillation is a training technique where a large, high-performing model (the teacher) transfers its learned knowledge to a smaller, more efficient model (the student). Rather than training the student on hard labels from a dataset, the student is trained to match the teacher's output distribution — including the relative probabilities assigned to incorrect answers, which encode valuable information about the relationships between concepts that hard labels discard.

In the LLM context, distillation most commonly takes the form of generating synthetic training data. A powerful teacher model (such as GPT-4 or Claude) generates high-quality responses to a diverse set of prompts, and a smaller open-source model is fine-tuned on these teacher-generated responses. This approach, used to create models like Alpaca and Vicuna, has proven remarkably effective at transferring capabilities from frontier models to models small enough to run on consumer hardware.

Distillation can operate at multiple levels: output-level distillation trains the student on the teacher's generated text; logit-level distillation trains the student to match the teacher's full probability distribution over tokens; and intermediate-level distillation aligns the student's internal representations with the teacher's hidden states. Each level captures progressively more of the teacher's knowledge but requires progressively more access to the teacher model's internals.

Why It Matters

The largest and most capable models are often too expensive or too slow for production deployment. A 70B parameter model might deliver excellent quality but requires multiple GPUs and costs dollars per thousand requests. Knowledge distillation allows teams to capture 80-90% of a large model's quality in a model that is 10-50x smaller, runs on a single GPU, and costs pennies per thousand requests.

Distillation also enables on-premise and edge deployment. Many organizations cannot send data to cloud-hosted large models due to privacy, regulatory, or latency requirements. By distilling a capable cloud model into a small local model, teams can deploy AI capabilities in constrained environments without sacrificing the quality gains from frontier model research.

How It Works

The most common distillation workflow for LLMs is response-based distillation. The practitioner curates a diverse set of prompts representing the target use case, runs them through the teacher model to generate high-quality responses, and then fine-tunes the student model on these prompt-response pairs using standard supervised training. The student learns to imitate the teacher's response style, reasoning patterns, and output quality.

More advanced approaches use the teacher's token-level probabilities (logits) as soft targets. Instead of training the student on a single correct response, the student learns to match the teacher's full probability distribution at each token position. This is more informative because the probabilities of non-chosen tokens encode relationships between concepts. A temperature parameter controls how much weight is given to low-probability tokens, with higher temperatures encouraging the student to learn broader distributional patterns.

Example Use Case

A company uses GPT-4 for contract analysis but needs to process sensitive documents on-premise. They distill GPT-4's contract analysis capabilities by generating 15,000 contract-analysis examples using GPT-4 on non-sensitive synthetic contracts, then fine-tune a Llama 3 8B model on these responses. The distilled model achieves 87% of GPT-4's accuracy on contract analysis while running entirely on-premise on a single GPU, satisfying both the quality and privacy requirements.

Key Takeaways

Knowledge distillation transfers capabilities from large teacher models to smaller, deployable student models.
Response-based distillation uses teacher-generated text as training data for the student.
Logit-level distillation is more effective but requires access to teacher model internals.
Distillation typically preserves 80-90% of teacher quality at 10-50x lower deployment cost.
It enables on-premise and edge deployment of capabilities developed on frontier cloud models.

How Ertas Helps

Ertas Studio supports knowledge distillation workflows where users can fine-tune smaller models on teacher-generated data. Ertas Data Suite helps prepare and clean synthetic datasets generated by teacher models, ensuring high-quality distillation training data.