
Model Distillation Explained: Run Sonnet-Quality Output on a $0 Inference Bill
A complete guide to model distillation — how to transfer capabilities from large frontier models like Claude Sonnet into small local models, achieving comparable quality at zero ongoing inference cost.
Your Claude Sonnet bill last month was $2,400. This month it will be higher because you added three new clients. Next month, higher again. The trajectory is clear, and it points in exactly one direction: up.
Model distillation is the engineering technique that breaks this cycle. You take the knowledge embedded in a large, expensive frontier model and compress it into a small, local model that runs on your own hardware at zero marginal cost per inference. Not reduced cost. Zero cost.
This is not theoretical. Teams running distilled models in production report 85-95% of the teacher model's quality on their specific tasks, with inference costs that round to $0.00 on the monthly ledger.
What Distillation Actually Is
Distillation is a training technique where a large "teacher" model generates labelled outputs, and a smaller "student" model learns to reproduce those outputs. The student does not need to learn everything the teacher knows. It only needs to learn the specific task you care about.
Think of it this way: Claude Sonnet can write poetry, debug Rust code, summarise medical research, and classify customer support tickets. You only need it to classify customer support tickets. A model that is 50x smaller can learn that one task extremely well, because the task space is narrow enough for a small model to cover completely.
The key insight: the frontier model is the labeller, not the end product. Every API call you make to Claude or GPT-4o is a potential training example. The expensive model does the hard cognitive work once. The cheap model learns to replicate that work indefinitely.
Why Small Models Can Match Large Models on Narrow Tasks
This seems counterintuitive. How can a 7B-parameter model match a 200B+ parameter model? The answer is in the word "narrow."
Large language models allocate their parameters across an enormous range of capabilities. Claude Sonnet can discuss quantum physics, medieval history, TypeScript generics, and French cooking. Most of those parameters are not relevant to your classification task.
When you fine-tune a small model on 2,000 high-quality examples of your specific task, you are concentrating all of that model's capacity on one thing. The 7B model does not need to discuss quantum physics. It needs to classify support tickets into 12 categories with 94% accuracy. That is well within the capacity of a focused small model.
Research from multiple labs confirms this pattern. On constrained tasks — classification, extraction, formatting, domain Q&A — fine-tuned models in the 3B-8B range consistently achieve 85-95% of frontier model performance. On some tasks, they exceed it, because fine-tuning eliminates the inconsistencies that general-purpose models sometimes exhibit.
The Distillation Workflow
The process has four stages. Each one is straightforward.
Stage 1: Define Your Task Precisely
This is the step most teams rush through, and it is the most important one.
A well-defined task has:
- Clear input format. What exactly does the model receive? A customer message? A document? A structured JSON object?
- Clear output format. What exactly should the model produce? A category label? A JSON object? A score between 1 and 10?
- Bounded scope. The task should be narrow enough that you can describe the complete output space. "Classify into one of 12 categories" is bounded. "Write a thoughtful response" is not.
- Measurable quality. You need a way to score outputs. Accuracy for classification, F1 for extraction, exact match for formatting.
If you cannot define these four things clearly, stop. Distillation works on well-defined tasks. Vague tasks produce vague results.
Stage 2: Generate Teacher Outputs
There are two sources for training data:
Production logs. If you are already using Claude or GPT-4o in production, you have logs. Every input-output pair is a training example. This is the best data source because it reflects your real-world distribution of inputs.
Synthetic generation. Create diverse inputs programmatically and run them through the teacher model. For a support ticket classifier, generate ticket variations covering all 12 categories, including edge cases and ambiguous inputs.
The target: 1,500-3,000 examples. This is not a typo. You do not need millions of examples. For well-defined narrow tasks, 1,500-3,000 high-quality examples are sufficient for strong performance. More data helps, but the returns diminish sharply after 3,000 examples for most classification and extraction tasks.
Stage 3: Fine-Tune the Student Model
Pick your student model. For most tasks, Llama 3.3 8B or Qwen 2.5 7B are strong choices. They are large enough to handle real-world complexity, small enough to run on modest hardware.
Fine-tuning configuration for distillation:
- Method: LoRA (rank 16-32 is sufficient for most distillation tasks)
- Learning rate: 2e-4 to 5e-4
- Epochs: 3-5 (watch for overfitting after epoch 3)
- Batch size: 4-8
- Training time: 30-90 minutes depending on dataset size and hardware
The training cost on Ertas: typically $5-15 for a standard distillation run. On your own GPU: the cost of electricity for 30-90 minutes of compute.
Stage 4: Evaluate Against the Teacher
Hold out 10-15% of your dataset for evaluation. Run both the teacher model and the student model on the same held-out inputs. Compare outputs.
For classification tasks, you are looking at accuracy. A well-distilled model typically achieves 90-95% agreement with the teacher on the held-out set. For extraction tasks, measure F1 score on extracted fields. For formatting tasks, measure exact-match rate on the output schema.
If the student scores below 85% agreement with the teacher, you likely need more training data, better data curation, or a larger student model.
The Cost Math
This is where distillation becomes compelling. Let us use real numbers.
Scenario: Customer support classifier handling 50,000 requests per month.
Using Claude Sonnet API:
- Average input: 200 tokens, average output: 50 tokens
- Cost per request: ~$0.0019 (at $3/$15 per million input/output tokens)
- Monthly cost: $95/month
- Annual cost: $1,140/year
Using a distilled Llama 8B running locally:
- One-time training cost: $10-15
- Hardware: runs on any machine with 8GB+ VRAM, or CPU with 16GB RAM
- Monthly inference cost: $0
- Annual cost: $10-15 total
The break-even point is approximately 2 weeks. After that, every inference is free.
Scale this to an agency running 10 client models, each handling 50,000 requests per month:
- API approach: $950/month, $11,400/year
- Distilled models: $100-150 one-time, then $0/month
That is $11,250 back in your pocket every year. For doing the exact same work.
Which Tasks Distill Well (and Which Do Not)
Not every task is a candidate for distillation. Here is an honest assessment.
Tasks That Distill Well
Classification. Categorising inputs into predefined labels. Support tickets, sentiment, intent detection, document categorisation. These are the ideal distillation targets. Fine-tuned small models routinely match or exceed frontier models because the output space is constrained and well-defined.
Extraction. Pulling structured data from unstructured text. Names, dates, amounts, addresses, product attributes. The pattern is learnable, and the output format is fixed.
Formatting and transformation. Converting data from one format to another. Markdown to HTML, natural language to SQL (with constrained schemas), text to JSON with a defined schema. The transformation rules are finite and learnable.
Domain-specific Q&A. Answering questions within a bounded knowledge domain, especially when you can embed the relevant context in the prompt. Medical terminology lookups, legal clause explanations, product FAQ responses.
Tasks That Do Not Distill Well
Open-ended generation. Writing marketing copy, creative content, or long-form text where quality is subjective. The output space is too large for a small model to cover with limited training data.
Complex multi-step reasoning. Tasks requiring chains of logical deduction, mathematical proofs, or multi-hop reasoning across diverse domains. These rely on the broad knowledge and reasoning depth that large models accumulate from massive pre-training.
Novel problem solving. Tasks where inputs regularly fall outside the training distribution. If your production inputs look significantly different from your training data, the distilled model will struggle.
Instruction following with high variability. Tasks where the user provides complex, varying instructions that change the expected output format. The distilled model learns fixed patterns, not flexible instruction interpretation.
Practical Example: Distilling Claude Sonnet for Support Classification
Let us walk through a concrete case. An AI agency is building a customer support system for an e-commerce client. The system needs to classify incoming tickets into 12 categories: order status, returns, billing, product questions, shipping, technical issues, account management, complaints, compliments, feature requests, partnerships, and spam.
Step 1: Define the task. Input is a customer message (10-500 words). Output is a JSON object with category, confidence score, and one-line reasoning.
Step 2: Generate training data. Use Claude Sonnet to classify 2,500 real support tickets from the client's historical data. Additionally, generate 500 synthetic edge cases covering ambiguous tickets that could fall into multiple categories.
Step 3: Curate. Review the 3,000 examples. Remove 200 where Claude's output was inconsistent or clearly wrong. Remove 50 duplicates. Final dataset: 2,750 examples.
Step 4: Fine-tune Llama 3.3 8B using LoRA in Ertas Studio. Training time: 45 minutes. Cost: $8.
Step 5: Evaluate on 275 held-out examples. Results: 93.1% agreement with Claude Sonnet on category assignment. On the cases where they disagree, human review shows the fine-tuned model is actually correct 40% of the time (Claude made errors on ambiguous cases that the fine-tuned model learned to handle better from the curated training data).
Step 6: Export to GGUF, deploy via Ollama on the client's server. Inference latency: 180ms average, compared to 800-1,200ms for the Sonnet API.
Result: faster, cheaper, more consistent, and running entirely on the client's infrastructure.
How Ertas Studio Streamlines the Pipeline
The distillation workflow described above involves multiple steps that traditionally require different tools, scripts, and manual coordination. Ertas Studio consolidates this into a single pipeline.
Dataset management. Import production logs or synthetic data. Ertas provides quality scoring and deduplication tools that flag low-quality examples before training begins.
Training configuration. Pre-configured LoRA templates optimised for distillation tasks. Select your base model (Llama, Qwen, Gemma, Mistral), adjust rank and learning rate, and launch training with one click.
Evaluation dashboard. Side-by-side comparison of teacher and student outputs on your held-out test set. Automated metrics (accuracy, F1, exact match) plus a review interface for manual spot-checking.
Export and deploy. One-click GGUF export with quantisation options (Q4, Q5, Q8). Direct integration with Ollama for local deployment. The model goes from training to production in minutes, not days.
The entire pipeline — from raw data to deployed model — typically takes 3-4 hours, including training time. The out-of-pocket cost is $5-50 depending on dataset size and model choice.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
The Bottom Line
Model distillation is not a hack or a shortcut. It is a legitimate engineering technique backed by years of research and deployed at scale by teams across the industry. The premise is simple: large models are general; your tasks are specific. A small model, properly trained on your specific task, can match the large model's quality at a fraction of the cost.
The economics are unambiguous. A one-time investment of $10-50 eliminates an ongoing expense of $100-1,000+ per month. The break-even period is measured in days, not months.
If you are currently paying a per-token API bill for a well-defined, repeatable task, you are a candidate for distillation. The question is not whether it works — that has been demonstrated thousands of times. The question is when you start.
Ready to distill your first model? See our technical guide to distillation with LoRA for the detailed engineering walkthrough, or read about the hidden costs of per-token pricing to understand the full financial picture.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

From Prompt Engineering to Fine-Tuning: The Migration Playbook
A practical playbook for teams migrating from prompt engineering to fine-tuning — when to make the switch, how to convert prompts into training data, and the step-by-step migration process.

Cleaning and Curating Datasets for Fine-Tuning Without a Data Science Team
Step-by-step guide to cleaning, validating, and curating fine-tuning datasets using no-code tools — covering deduplication, label validation, format checks, and distribution analysis for non-technical teams.

Synthetic Data for Fine-Tuning: How to Generate Training Data That Actually Works
A practical guide to generating synthetic training data for fine-tuning — covering prompt strategies, quality filtering, distribution matching, and the 80/20 rule for mixing real and synthetic data.