The Anthropic/DeepSeek controversy showed what happens when distillation goes wrong: 24,000 accounts banned, international headlines, and potential legal action. But the technique itself is sound — the problem was the source, not the method.

This guide walks through legal model distillation from start to finish. Open-source teacher models with permissive licences. Your own domain data. A clear path from training to deployment with full model ownership and zero legal risk.

If you've been watching the distillation debate and thinking "there has to be a right way to do this" — there is. Here's how.

Legal vs. Illegal Distillation: The Bright Line

The distinction is simpler than the headlines suggest.

Legal distillation: Using open-source models with permissive licences as teachers. The licence grants you the right to create derivative works, including training smaller models on the teacher's outputs. You're exercising rights the model creator explicitly gave you.

Illegal distillation (or ToS-violating): Using closed-API models (GPT-4, Claude, Gemini) as teachers against their Terms of Service. The legal landscape is still evolving, but the contractual prohibition is clear.

The test: Before using any model as a teacher, read its licence. If it permits derivative works and commercial use, you're clear. If it prohibits using outputs for competitive model training, don't.

Step 1: Choose Your Open-Source Teacher Model

Your teacher model determines the quality ceiling of your student. Choose carefully.

Licence Comparison

Model	Licence	Distillation Permitted	Commercial Use	Attribution Required
Llama 3 (8B-405B)	Meta Community	Yes	Yes (under 700M MAU)	Yes
Qwen 2.5 (7B-72B)	Apache 2.0	Yes	Yes	Yes
Mistral (7B-8x22B)	Apache 2.0	Yes	Yes	Yes
DeepSeek-R1 (open weights)	MIT	Yes	Yes	Minimal
Gemma 2 (2B-27B)	Gemma Terms	Yes	Yes	Yes
Phi-3 (3.8B-14B)	MIT	Yes	Yes	Minimal

All of these models explicitly permit you to create derivative works, including distilled models. The legal path is clear.

Which Teacher to Choose

For general reasoning and instruction-following: Llama 3 70B or Qwen 2.5 72B. Both are strong general-purpose models that produce high-quality training data across domains.

For coding and technical tasks: DeepSeek-R1 or Qwen 2.5 Coder. Strong performance on structured output and code generation.

For multilingual tasks: Qwen 2.5 excels across languages. Llama 3 is English-dominant but competent in major languages.

For constrained environments: If you can't run a 70B model, Llama 3 8B or Qwen 2.5 14B are solid teachers for domain-specific tasks where you're supplementing with your own data.

The best model for fine-tuning in 2026 depends on your specific use case, but any model in the table above gives you a legally clean starting point.

Step 2: Set Up the Teacher

You need to run your teacher model in an environment where you can generate training data at scale. Three options:

Option A: Local Deployment

Run the teacher on your own hardware using Ollama, vLLM, or llama.cpp. This gives you complete control and no usage limits.

Requirements for a 70B model:

2x NVIDIA A100 (80GB) or equivalent
128GB+ system RAM
Expect 10-30 tokens/second for generation

Requirements for a 7B-14B teacher:

Single NVIDIA RTX 4090 or Apple M2 Ultra
32GB+ RAM
Expect 30-80 tokens/second

Option B: Cloud GPU Rental

Use RunPod, Lambda, or similar GPU cloud providers to spin up temporary inference instances. Cost-effective for generating training data in bulk — you only pay for the hours you need.

Typical cost: $2-5/hour for an A100 instance. You can generate thousands of training examples in a few hours.

Option C: Open-Source API Providers

Services like Together AI, Fireworks, or Groq offer API access to open-source models. Since you're using the models under their open-source licences (not a proprietary API), generating training data from these endpoints is legally equivalent to running the model locally.

Check the provider's terms to confirm they don't add restrictions on top of the model's open-source licence.

Step 3: Generate Synthetic Training Data From Your Domain

This is where distillation intersects with fine-tuning — and where the real value gets created.

Instead of distilling generic capabilities from the teacher, you use the teacher to process your domain-specific content. The result is training data that combines the teacher's language ability with your proprietary knowledge.

The Process

1. Prepare your source material. Gather your domain-specific content:

Customer support logs and resolved tickets
Product documentation and knowledge base articles
Sales call transcripts and frequently asked questions
Industry-specific documents and regulatory texts
Any structured data specific to your business

2. Design your prompts. Create prompt templates that ask the teacher to perform your target task using your source material as context.

For example, if you're building a support classification model:

Given this support ticket:
"{ticket_text}"

Classify into one of these categories: {your_categories}

Respond with only the category name.

3. Generate at scale. Run your source material through the teacher model. For each input, collect the output. This creates your training dataset.

Target dataset sizes:

Minimum viable: 500 examples (surprisingly effective for well-defined tasks)
Solid baseline: 1,000-2,000 examples
Production-quality: 2,000-5,000 examples
Complex tasks: 5,000-10,000 examples

More data isn't always better. A well-curated dataset of 500 high-quality examples typically outperforms 5,000 noisy ones.

4. Add your proprietary signal. The teacher generates competent outputs. Your domain expertise makes them excellent. Review the generated data and:

Correct any outputs that don't match your quality standards
Add examples the teacher got wrong (edge cases, domain-specific nuances)
Ensure your output format is consistent across all examples
Verify domain terminology is used correctly

This step is what separates generic distillation from domain-specific fine-tuning. The teacher provides the language capability. Your data and review provide the domain expertise.

Step 4: Prepare the Dataset

Training datasets for fine-tuning typically use JSONL format — one JSON object per line, with each object containing the conversation structure.

Format

{"messages": [{"role": "system", "content": "You are a support classifier for Acme Corp."}, {"role": "user", "content": "My invoice is wrong"}, {"role": "assistant", "content": "billing_issue"}]}
{"messages": [{"role": "system", "content": "You are a support classifier for Acme Corp."}, {"role": "user", "content": "Can't log in"}, {"role": "assistant", "content": "account_access"}]}

Quality Filtering

Before training, clean your dataset:

Remove duplicates. Near-identical examples add noise without value.
Remove contradictions. If the same input maps to different outputs, resolve the conflict.
Validate format consistency. Every example should follow the same output structure.
Check for hallucinations. Review teacher outputs for factual errors against your source material.
Balance categories. If you're training a classifier, ensure reasonable representation across categories.

A dataset with 1,000 clean examples will outperform one with 5,000 messy ones. Quality beats quantity every time.

Don't want to manage JSONL files and training configs? Ertas handles dataset preparation visually. Join the waitlist →

Step 5: Fine-Tune the Student Model

Now you train your smaller student model on the dataset you've prepared. This is where the distilled knowledge gets compressed into a deployable model.

Choosing the Student

Your student should be smaller than your teacher — that's the point of distillation. Common choices:

Teacher 70B → Student 7B-14B — The sweet spot for most production deployments
Teacher 14B → Student 3B-7B — For edge deployment or resource-constrained environments
Same family preferred — Distilling Llama 70B to Llama 7B works better than cross-family distillation

Training Approach: LoRA/QLoRA

Full fine-tuning of even a 7B model requires significant GPU memory. LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) make this practical:

LoRA adds small trainable adapters (~50-200MB) on top of frozen base weights
QLoRA combines LoRA with 4-bit quantisation, reducing memory requirements further
A 7B model can be fine-tuned with QLoRA on a single GPU with 24GB VRAM

Key training parameters:

Learning rate: 1e-4 to 2e-4
Batch size: 4-8 (with gradient accumulation)
Epochs: 2-4 (more data = fewer epochs needed)
LoRA rank: 16-64 (higher = more capacity, more memory)

Using Ertas

If you'd rather skip the configuration entirely, Ertas handles this step visually:

Upload your JSONL dataset (or import from Hugging Face)
Select your base model
Adjust training parameters through the UI (or use recommended defaults)
Start training — track progress in real-time
Compare results side-by-side if running multiple experiments

No Python. No YAML. No CLI. The platform handles the infrastructure and returns your fine-tuned model.

Step 6: Evaluate Student vs. Teacher

Before deploying, verify that your student model meets your quality bar.

Automated Evaluation

Run both teacher and student on a held-out test set (examples not used in training). Compare:

Accuracy — Does the student produce correct outputs?
Format compliance — Does the student follow the expected output structure?
Consistency — Does the student give the same answer for similar inputs?
Latency — How fast does the student respond? (Should be faster than the teacher)

Human Evaluation

For subjective tasks (writing, conversation, summarisation), automated metrics don't tell the full story. Run a blind evaluation:

Generate 50-100 outputs from both teacher and student on the same inputs
Present outputs side-by-side without labels
Have a domain expert rate quality, accuracy, and appropriateness
Track where the student falls short — these gaps inform your next training iteration

Target Benchmarks

For domain-specific tasks with well-curated training data, you should expect:

90-95% accuracy on classification and extraction tasks
Within 5-10% of teacher quality on generation tasks
3-10x faster inference than the teacher (depending on size ratio)
Matching or exceeding prompt-engineered GPT-4 on narrow, well-defined tasks

If your student falls short, the fix is usually better training data — not a bigger student model. Go back to Step 3 and add more high-quality examples in the areas where the student struggles.

Step 7: Export to GGUF and Deploy

Once your student model passes evaluation, export it for production deployment.

GGUF Export

GGUF is the standard format for local model deployment. It's compatible with:

Ollama — simplest deployment, single command to run
llama.cpp — lightweight, runs on CPU and GPU
LM Studio — desktop app with UI
vLLM — high-throughput production serving

Quantisation Options

GGUF supports multiple quantisation levels that trade model size for quality:

Q8 — Highest quality, ~8GB for a 7B model
Q5 — Good quality with smaller size, ~5.5GB for a 7B model
Q4 — Acceptable quality for most tasks, ~4.5GB for a 7B model

For domain-specific tasks where the model has been fine-tuned on high-quality data, Q5 quantisation typically maintains 95%+ of full-precision quality. Start there and adjust based on your quality requirements.

Deployment

# With Ollama
ollama create my-model -f Modelfile
ollama run my-model

# With llama.cpp
./server -m my-model.gguf --port 8080

Your model is now running locally. No API calls. No per-token costs. No vendor that can deprecate it. No ToS that can change underneath you.

The Ertas Workflow: All of This Without a CLI

The steps above work. They're also a lot of manual work — setting up inference environments, managing JSONL files, configuring training runs, handling quantisation and export.

Ertas compresses this entire pipeline into a visual interface:

Upload your dataset (or import from Hugging Face via URL)
Select your base model from supported open-source options
Configure and train with visual controls and recommended defaults
Evaluate with built-in comparison tools
Export to GGUF with one click

The entire process — from dataset upload to deployable GGUF file — takes minutes instead of hours. No Python environment to manage. No GPU provisioning. No YAML configuration files.

The models you create are yours. Download the GGUF, deploy it wherever you want. Ertas is the training tool, not the deployment lock-in.

The Right Way to Distill

The DeepSeek situation wasn't a failure of distillation as a technique. It was a failure of strategy — choosing to extract capabilities from a closed platform instead of building on open foundations.

The right approach:

Open-source teacher with a permissive licence
Your domain data as the differentiator
Legal fine-tuning that produces capabilities nobody else has
GGUF export for portable, vendor-independent deployment
Full ownership of every component in the pipeline

No fake accounts. No ToS violations. No legal risk. Just a better model that's actually yours.

Do all of this without touching a CLI. Ertas handles the entire distillation pipeline visually — from dataset to GGUF in minutes. Pre-subscribe at early-bird pricing. See plans →

How to Distill Open-Source Models Legally: A Step-by-Step Guide