Model Distillation Is Not Theft — But Here's Why You Should Do It Yourself

The word "distillation" became controversial overnight. After Anthropic revealed that DeepSeek, Moonshot AI, and MiniMax had used 24,000 accounts to systematically extract Claude's capabilities, the technique was framed in headlines as something between espionage and theft.

That framing is misleading. Distillation is one of the most widely used techniques in machine learning. The problem isn't the method — it's where you point it.

Understanding this distinction matters for every team building with AI, because the same technique that got three Chinese labs into trouble is also the cleanest path to owning AI capabilities that nobody can take away from you.

Distillation 101: What It Actually Is

Knowledge distillation was introduced in a 2015 paper by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. The idea is simple: take a large, expensive "teacher" model and train a smaller, cheaper "student" model to mimic its behaviour.

The teacher generates outputs for a set of inputs. The student learns to reproduce those outputs. The result is a smaller model that captures a useful subset of the teacher's capabilities at significantly lower compute cost.

This is not a fringe technique. It's foundational to how the AI industry operates:

OpenAI distills its own models to create GPT-4o-mini and other cheaper variants
Google distills Gemini Ultra into Gemini Flash for low-latency applications
Anthropic uses internal distillation to produce lighter versions of Claude
Meta published Llama specifically so the community could build on it — including through distillation

Every lab with a frontier model also has smaller models derived from it. Distillation is how they get there.

The Spectrum of Legitimacy

Not all distillation is the same. There's a spectrum, and where you sit on it determines whether you're doing standard ML engineering or violating someone's terms of service.

Level 1: Open-Source to Open-Source (Fully Permitted)

You take Llama 3 70B and use its outputs to train a Llama 3 7B variant. Meta's licence explicitly permits this as long as you disclose the model's origin. The Qwen and Mistral licences have similar provisions.

This is the equivalent of reading a textbook and writing your own notes. The knowledge is freely available. The application is yours.

Level 2: Closed API to Your Own Model (ToS Violation)

You take Claude or GPT-4 outputs at scale and use them to train a competing model. The technical process is identical to Level 1. But the provider's Terms of Service prohibit using outputs as training data for competing models.

This is where DeepSeek, Moonshot, and MiniMax operated. They used the API as designed, received outputs, and repurposed those outputs in a way the ToS forbids.

Level 3: Proprietary Access Exploitation (Theft)

You obtain model weights through unauthorised access — a data breach, an insider, reverse engineering — and use them directly. This is straightforward intellectual property theft, potentially criminal.

The DeepSeek situation sits squarely at Level 2. They didn't steal weights. They didn't hack systems. They used the API, paid for access, and used the outputs in a way Anthropic's ToS prohibits. Anthropic's own blog acknowledged that distillation is "a widely used and legitimate training method" — the issue was contractual, not criminal. Even the South China Morning Post's analysis noted that Anthropic acknowledged "the distilling technique itself was not illegal." It's a ToS violation, not a heist — but that distinction didn't stop the story from being framed as something more dramatic.

It's also worth noting the scope of these prohibitions. Anthropic's Usage Policy doesn't just prohibit training competing models — it prohibits using outputs to train any AI model without prior authorisation. This means the same clause that applies to DeepSeek's industrial-scale extraction also applies, technically, to a five-person SaaS team fine-tuning a small classifier on logged API responses. The Terms of Service make no distinction based on scale, intent, or whether you're a competitor.

Why Every Major Lab Distills

The fact that OpenAI, Anthropic, Google, and Meta all distill their own models should tell you something: the technique produces real value.

Internal distillation lets labs offer cheaper model variants without training from scratch. GPT-4o-mini isn't a separate model developed independently — it inherits capabilities from GPT-4o through distillation, then gets optimised for cost and latency.

This creates a strategic asymmetry. The labs that build frontier models can distill them into a product line. Everyone else has to either build their own frontier model (billions of dollars) or find another way to get similar capabilities into a smaller package.

That "other way" is what drove the DeepSeek campaign. They couldn't legally distill Claude. They couldn't afford to build a Claude-equivalent from scratch on the same timeline. So they found a middle path that turned out to violate Anthropic's ToS.

The lesson isn't that distillation is wrong. The lesson is that distilling from closed APIs is strategically fragile — even if you manage to do it without getting caught.

Why Distilling From Closed APIs Is a Bad Strategy

Set aside the ethics for a moment. Even if you could distill from Claude or GPT-4 without consequence, it's a poor strategic choice.

You get generic capabilities, not domain expertise. A model distilled from GPT-4 knows what GPT-4 knows — which is everything and nothing in particular. It won't understand your customers' terminology, your industry's edge cases, or your product's specific requirements. You get a cheaper version of a generalist, not a specialist.

You inherit the teacher's weaknesses. If GPT-4 hallucinates on certain topics, your distilled model will too. If Claude has blind spots in your domain, those blind spots transfer to the student. You can't fix what you don't control.

You can't iterate. When your business needs change, you can't retrain the teacher. You can't add new examples to its training set. You can't adjust its behaviour for new use cases. You're stuck with a snapshot of someone else's capabilities at a single point in time.

You're one detection system away from losing everything. Anthropic caught these campaigns. They'll catch more. The detection technology will only improve. Building a business dependency on a technique that violates your provider's ToS is building on borrowed time.

Fine-tune open-source models on your own data — no ToS violations, no vendor dependency. Join the Ertas waitlist →

The Better Path: Fine-Tune Open-Source Models on Your Own Data

There's a version of model distillation that's completely legal, strategically sound, and produces better results for domain-specific applications. It looks like this:

Step 1: Start with an open-source base. Llama 3, Qwen 2.5, Mistral, Gemma — all available under licences that permit commercial use and derivative model creation. You download the weights. They're yours.

Step 2: Use the base model (or a larger variant) as a teacher on your data. Run Llama 70B over your domain-specific documents, support logs, or product data. Generate training examples that combine the model's general intelligence with your domain context. This is legal distillation — open-source model to open-source model, with your proprietary data mixed in.

Step 3: Fine-tune a smaller model on those examples. Take Llama 7B or 14B and fine-tune it on the dataset you just created. The result is a model that combines general language capability with deep understanding of your specific domain.

Step 4: Export and deploy. Export to GGUF format. Run on Ollama, llama.cpp, LM Studio, or any compatible inference engine. No API calls. No per-token costs. No vendor dependency.

This is what genuine model ownership looks like. You control the base model (open-source). You control the training data (yours). You control the fine-tuned model (yours). You control the deployment (your infrastructure).

Nobody can deprecate it. Nobody can change the pricing. Nobody can revoke your access. Nobody can ban your account.

Owned Models Beat Copied Models

Here's the part that often gets lost in the distillation debate: a fine-tuned model trained on your domain data typically outperforms a distilled copy of a frontier model on your specific tasks.

This isn't intuitive. GPT-4 is objectively more capable than a 7B parameter model on general benchmarks. But general benchmarks don't measure what matters for production applications.

What matters is: does the model understand your customer's terminology? Does it follow your output format consistently? Does it handle your industry's edge cases? Does it produce outputs that align with your quality standards?

A distilled copy of GPT-4 gives you a compressed generalist. A fine-tuned 7B model trained on 500-2,000 examples of your specific task gives you a specialist that hits 90-95% accuracy — often matching or exceeding what prompt-engineered GPT-4 delivers on narrow, well-defined work.

A B2B SaaS company fine-tuning on its own support ticket data measured 94% classification accuracy. The same task with prompt-engineered GPT-4 reached 71%. That 23-percentage-point gap is the difference between "mostly works" and "ready for production."

The distilled model is generic. The fine-tuned model is yours. One is a commodity. The other is a competitive advantage.

The Ethical Framework

If you want a simple decision framework for distillation:

Is the source model open-source with a permissive licence? Use it freely. Llama, Qwen, Mistral — all fair game. Distill, fine-tune, deploy commercially. Just comply with attribution requirements where specified.

Is the source model behind a closed API? Don't use its outputs for training data. Not because the technique is wrong, but because it violates the ToS, creates legal risk, and produces strategically inferior results compared to fine-tuning on your own data.

Do you have domain-specific data? Fine-tuning on your data produces better results for your use case than distillation from any generic model, closed or open. Your data is your unfair advantage — use it.

Do you need to compress a model you already own? Distill your own fine-tuned model from 70B to 7B for deployment. This is the hybrid approach that gives you both quality and efficiency — and it's entirely within your control.

The DeepSeek story isn't a cautionary tale about distillation. It's a cautionary tale about dependency on AI systems you don't own. The technique itself is sound. The question is whether you apply it to build on foundations you control — or foundations that can be pulled out from under you at any time.

What This Means for Builders

If you're building AI-powered products or services, the path forward is straightforward:

Use open-source base models. The quality is there. Llama 3, Qwen 2.5, and Mistral are production-ready.
Invest in your training data. Your domain data is your moat. Curate it, clean it, structure it.
Fine-tune for your specific tasks. Don't settle for generic — build models that understand your business.
Own the result. Export to GGUF. Deploy on your terms. Control your costs.

You don't need to create 24,000 fake accounts. You don't need to worry about ToS violations. You don't need to depend on any single AI provider.

You just need your data and a fine-tuning pipeline.

Skip the gray zone entirely. Fine-tune your own models with Ertas — full pipeline from dataset to GGUF, no code required. Pre-subscribe at early-bird pricing. See plans →