Back to blog
    How to Distill Open-Source Models Legally: A Step-by-Step Guide
    distillationopen-sourcetutoriallegalfine-tuninggguf

    How to Distill Open-Source Models Legally: A Step-by-Step Guide

    A practical guide to model distillation the right way: using open-source teacher models with permissive licenses, your own domain data, and a clear legal path to model ownership.

    EErtas Team··Updated

    The Anthropic/DeepSeek controversy showed what happens when distillation goes wrong: 24,000 accounts banned, international headlines, and potential legal action. But the technique itself is sound — the problem was the source, not the method.

    This guide walks through legal model distillation from start to finish. Open-source teacher models with permissive licences. Your own domain data. A clear path from training to deployment with full model ownership and zero legal risk.

    If you've been watching the distillation debate and thinking "there has to be a right way to do this" — there is. Here's how.

    The distinction is simpler than the headlines suggest.

    Legal distillation: Using open-source models with permissive licences as teachers. The licence grants you the right to create derivative works, including training smaller models on the teacher's outputs. You're exercising rights the model creator explicitly gave you.

    Illegal distillation (or ToS-violating): Using closed-API models (GPT-4, Claude, Gemini) as teachers against their Terms of Service. The legal landscape is still evolving, but the contractual prohibition is clear.

    The test: Before using any model as a teacher, read its licence. If it permits derivative works and commercial use, you're clear. If it prohibits using outputs for competitive model training, don't.

    Step 1: Choose Your Open-Source Teacher Model

    Your teacher model determines the quality ceiling of your student. Choose carefully.

    Licence Comparison

    ModelLicenceDistillation PermittedCommercial UseAttribution Required
    Llama 3 (8B-405B)Meta CommunityYesYes (under 700M MAU)Yes
    Qwen 2.5 (7B-72B)Apache 2.0YesYesYes
    Mistral (7B-8x22B)Apache 2.0YesYesYes
    DeepSeek-R1 (open weights)MITYesYesMinimal
    Gemma 2 (2B-27B)Gemma TermsYesYesYes
    Phi-3 (3.8B-14B)MITYesYesMinimal

    All of these models explicitly permit you to create derivative works, including distilled models. The legal path is clear.

    Which Teacher to Choose

    For general reasoning and instruction-following: Llama 3 70B or Qwen 2.5 72B. Both are strong general-purpose models that produce high-quality training data across domains.

    For coding and technical tasks: DeepSeek-R1 or Qwen 2.5 Coder. Strong performance on structured output and code generation.

    For multilingual tasks: Qwen 2.5 excels across languages. Llama 3 is English-dominant but competent in major languages.

    For constrained environments: If you can't run a 70B model, Llama 3 8B or Qwen 2.5 14B are solid teachers for domain-specific tasks where you're supplementing with your own data.

    The best model for fine-tuning in 2026 depends on your specific use case, but any model in the table above gives you a legally clean starting point.

    Step 2: Set Up the Teacher

    You need to run your teacher model in an environment where you can generate training data at scale. Three options:

    Option A: Local Deployment

    Run the teacher on your own hardware using Ollama, vLLM, or llama.cpp. This gives you complete control and no usage limits.

    Requirements for a 70B model:

    • 2x NVIDIA A100 (80GB) or equivalent
    • 128GB+ system RAM
    • Expect 10-30 tokens/second for generation

    Requirements for a 7B-14B teacher:

    • Single NVIDIA RTX 4090 or Apple M2 Ultra
    • 32GB+ RAM
    • Expect 30-80 tokens/second

    Option B: Cloud GPU Rental

    Use RunPod, Lambda, or similar GPU cloud providers to spin up temporary inference instances. Cost-effective for generating training data in bulk — you only pay for the hours you need.

    Typical cost: $2-5/hour for an A100 instance. You can generate thousands of training examples in a few hours.

    Option C: Open-Source API Providers

    Services like Together AI, Fireworks, or Groq offer API access to open-source models. Since you're using the models under their open-source licences (not a proprietary API), generating training data from these endpoints is legally equivalent to running the model locally.

    Check the provider's terms to confirm they don't add restrictions on top of the model's open-source licence.

    Step 3: Generate Synthetic Training Data From Your Domain

    This is where distillation intersects with fine-tuning — and where the real value gets created.

    Instead of distilling generic capabilities from the teacher, you use the teacher to process your domain-specific content. The result is training data that combines the teacher's language ability with your proprietary knowledge.

    The Process

    1. Prepare your source material. Gather your domain-specific content:

    • Customer support logs and resolved tickets
    • Product documentation and knowledge base articles
    • Sales call transcripts and frequently asked questions
    • Industry-specific documents and regulatory texts
    • Any structured data specific to your business

    2. Design your prompts. Create prompt templates that ask the teacher to perform your target task using your source material as context.

    For example, if you're building a support classification model:

    Given this support ticket:
    "{ticket_text}"
    
    Classify into one of these categories: {your_categories}
    
    Respond with only the category name.
    

    3. Generate at scale. Run your source material through the teacher model. For each input, collect the output. This creates your training dataset.

    Target dataset sizes:

    • Minimum viable: 500 examples (surprisingly effective for well-defined tasks)
    • Solid baseline: 1,000-2,000 examples
    • Production-quality: 2,000-5,000 examples
    • Complex tasks: 5,000-10,000 examples

    More data isn't always better. A well-curated dataset of 500 high-quality examples typically outperforms 5,000 noisy ones.

    4. Add your proprietary signal. The teacher generates competent outputs. Your domain expertise makes them excellent. Review the generated data and:

    • Correct any outputs that don't match your quality standards
    • Add examples the teacher got wrong (edge cases, domain-specific nuances)
    • Ensure your output format is consistent across all examples
    • Verify domain terminology is used correctly

    This step is what separates generic distillation from domain-specific fine-tuning. The teacher provides the language capability. Your data and review provide the domain expertise.

    Step 4: Prepare the Dataset

    Training datasets for fine-tuning typically use JSONL format — one JSON object per line, with each object containing the conversation structure.

    Format

    {"messages": [{"role": "system", "content": "You are a support classifier for Acme Corp."}, {"role": "user", "content": "My invoice is wrong"}, {"role": "assistant", "content": "billing_issue"}]}
    {"messages": [{"role": "system", "content": "You are a support classifier for Acme Corp."}, {"role": "user", "content": "Can't log in"}, {"role": "assistant", "content": "account_access"}]}
    

    Quality Filtering

    Before training, clean your dataset:

    • Remove duplicates. Near-identical examples add noise without value.
    • Remove contradictions. If the same input maps to different outputs, resolve the conflict.
    • Validate format consistency. Every example should follow the same output structure.
    • Check for hallucinations. Review teacher outputs for factual errors against your source material.
    • Balance categories. If you're training a classifier, ensure reasonable representation across categories.

    A dataset with 1,000 clean examples will outperform one with 5,000 messy ones. Quality beats quantity every time.

    Don't want to manage JSONL files and training configs? Ertas handles dataset preparation visually. Join the waitlist →

    Step 5: Fine-Tune the Student Model

    Now you train your smaller student model on the dataset you've prepared. This is where the distilled knowledge gets compressed into a deployable model.

    Choosing the Student

    Your student should be smaller than your teacher — that's the point of distillation. Common choices:

    • Teacher 70B → Student 7B-14B — The sweet spot for most production deployments
    • Teacher 14B → Student 3B-7B — For edge deployment or resource-constrained environments
    • Same family preferred — Distilling Llama 70B to Llama 7B works better than cross-family distillation

    Training Approach: LoRA/QLoRA

    Full fine-tuning of even a 7B model requires significant GPU memory. LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) make this practical:

    • LoRA adds small trainable adapters (~50-200MB) on top of frozen base weights
    • QLoRA combines LoRA with 4-bit quantisation, reducing memory requirements further
    • A 7B model can be fine-tuned with QLoRA on a single GPU with 24GB VRAM

    Key training parameters:

    • Learning rate: 1e-4 to 2e-4
    • Batch size: 4-8 (with gradient accumulation)
    • Epochs: 2-4 (more data = fewer epochs needed)
    • LoRA rank: 16-64 (higher = more capacity, more memory)

    Using Ertas

    If you'd rather skip the configuration entirely, Ertas handles this step visually:

    1. Upload your JSONL dataset (or import from Hugging Face)
    2. Select your base model
    3. Adjust training parameters through the UI (or use recommended defaults)
    4. Start training — track progress in real-time
    5. Compare results side-by-side if running multiple experiments

    No Python. No YAML. No CLI. The platform handles the infrastructure and returns your fine-tuned model.

    Step 6: Evaluate Student vs. Teacher

    Before deploying, verify that your student model meets your quality bar.

    Automated Evaluation

    Run both teacher and student on a held-out test set (examples not used in training). Compare:

    • Accuracy — Does the student produce correct outputs?
    • Format compliance — Does the student follow the expected output structure?
    • Consistency — Does the student give the same answer for similar inputs?
    • Latency — How fast does the student respond? (Should be faster than the teacher)

    Human Evaluation

    For subjective tasks (writing, conversation, summarisation), automated metrics don't tell the full story. Run a blind evaluation:

    1. Generate 50-100 outputs from both teacher and student on the same inputs
    2. Present outputs side-by-side without labels
    3. Have a domain expert rate quality, accuracy, and appropriateness
    4. Track where the student falls short — these gaps inform your next training iteration

    Target Benchmarks

    For domain-specific tasks with well-curated training data, you should expect:

    • 90-95% accuracy on classification and extraction tasks
    • Within 5-10% of teacher quality on generation tasks
    • 3-10x faster inference than the teacher (depending on size ratio)
    • Matching or exceeding prompt-engineered GPT-4 on narrow, well-defined tasks

    If your student falls short, the fix is usually better training data — not a bigger student model. Go back to Step 3 and add more high-quality examples in the areas where the student struggles.

    Step 7: Export to GGUF and Deploy

    Once your student model passes evaluation, export it for production deployment.

    GGUF Export

    GGUF is the standard format for local model deployment. It's compatible with:

    • Ollama — simplest deployment, single command to run
    • llama.cpp — lightweight, runs on CPU and GPU
    • LM Studio — desktop app with UI
    • vLLM — high-throughput production serving

    Quantisation Options

    GGUF supports multiple quantisation levels that trade model size for quality:

    • Q8 — Highest quality, ~8GB for a 7B model
    • Q5 — Good quality with smaller size, ~5.5GB for a 7B model
    • Q4 — Acceptable quality for most tasks, ~4.5GB for a 7B model

    For domain-specific tasks where the model has been fine-tuned on high-quality data, Q5 quantisation typically maintains 95%+ of full-precision quality. Start there and adjust based on your quality requirements.

    Deployment

    # With Ollama
    ollama create my-model -f Modelfile
    ollama run my-model
    
    # With llama.cpp
    ./server -m my-model.gguf --port 8080
    

    Your model is now running locally. No API calls. No per-token costs. No vendor that can deprecate it. No ToS that can change underneath you.

    The Ertas Workflow: All of This Without a CLI

    The steps above work. They're also a lot of manual work — setting up inference environments, managing JSONL files, configuring training runs, handling quantisation and export.

    Ertas compresses this entire pipeline into a visual interface:

    1. Upload your dataset (or import from Hugging Face via URL)
    2. Select your base model from supported open-source options
    3. Configure and train with visual controls and recommended defaults
    4. Evaluate with built-in comparison tools
    5. Export to GGUF with one click

    The entire process — from dataset upload to deployable GGUF file — takes minutes instead of hours. No Python environment to manage. No GPU provisioning. No YAML configuration files.

    The models you create are yours. Download the GGUF, deploy it wherever you want. Ertas is the training tool, not the deployment lock-in.

    The Right Way to Distill

    The DeepSeek situation wasn't a failure of distillation as a technique. It was a failure of strategy — choosing to extract capabilities from a closed platform instead of building on open foundations.

    The right approach:

    1. Open-source teacher with a permissive licence
    2. Your domain data as the differentiator
    3. Legal fine-tuning that produces capabilities nobody else has
    4. GGUF export for portable, vendor-independent deployment
    5. Full ownership of every component in the pipeline

    No fake accounts. No ToS violations. No legal risk. Just a better model that's actually yours.


    Do all of this without touching a CLI. Ertas handles the entire distillation pipeline visually — from dataset to GGUF in minutes. Pre-subscribe at early-bird pricing. See plans →

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading