
Distilling Claude/GPT into a 7B Model for Production: Step-by-Step
A step-by-step tutorial for distilling the capabilities of Claude or GPT-4o into a 7B parameter model for local production deployment — from dataset generation through fine-tuning to GGUF export.
You have a system running on Claude Sonnet or GPT-4o. It works. The quality is solid. But every request costs money, every request adds latency, and every request sends your data to someone else's servers. You want to own the model that powers your product.
This tutorial walks you through the complete distillation pipeline: from defining your task through generating synthetic data, fine-tuning a 7B model, and deploying it locally via Ollama. Every step includes concrete parameters, expected outputs, and troubleshooting guidance.
Total time: 3-4 hours (including training wait time) Total cost: $10-30 for the entire pipeline Expected result: 85-95% of teacher model quality on your specific task
Prerequisites
Before you start, you need:
- An Ertas account (free tier is sufficient for this tutorial)
- API access to your teacher model — either an Anthropic API key (for Claude) or an OpenAI API key (for GPT-4o)
- A clearly defined task — classification, extraction, formatting, or domain Q&A
- 50 seed examples of your task, with correct input-output pairs
- A machine for local deployment — 8GB+ VRAM GPU or 16GB+ RAM for CPU inference
Let us walk through each step.
Step 1: Define Your Task Scope
Distillation works on narrow tasks. The narrower, the better. Your task definition should fit in one sentence:
Good task definitions:
- "Classify customer support emails into one of 8 categories and return a JSON object with category and confidence."
- "Extract vendor name, invoice number, date, and total amount from invoice PDFs (pre-processed to text)."
- "Convert natural language search queries into Elasticsearch DSL queries for our product catalogue schema."
- "Answer questions about our API documentation using the relevant docs section as context."
Bad task definitions (too broad):
- "Help customers with their questions."
- "Process documents."
- "Write responses to emails."
Your task definition should specify:
- Input format: What does the model receive? How long is a typical input? What is the range?
- Output format: What exactly should the model produce? JSON? A label? Structured text?
- Output space: How many possible outputs are there? (12 categories, 5 extracted fields, etc.)
- Quality metric: How will you measure if the output is correct? Accuracy? F1? Exact match?
Write this down. You will refer to it throughout the process.
For this tutorial, we will use a running example: classifying SaaS product feedback into 6 categories (bug report, feature request, usability issue, performance complaint, positive feedback, question) and extracting the affected product area.
Step 2: Create 50 Seed Examples Manually
This is the step that feels tedious but determines the quality of everything downstream. You need 50 hand-crafted input-output pairs that represent the gold standard for your task.
Why 50? Because these seed examples serve three purposes:
- Prompt calibration — they help you write the prompt for the teacher model
- Quality reference — they define what "correct" looks like
- Test set foundation — 20 of them will become your initial evaluation set
Guidelines for seed examples:
- Cover all categories/output types. If you have 6 categories, include at least 8 examples per category.
- Include edge cases. Add 5-10 examples that are ambiguous, multi-label, or unusual.
- Match production distribution. If 40% of your real inputs are bug reports, roughly 40% of seeds should be bug reports.
- Be precise in outputs. Your outputs should be exactly what you want the model to produce — same JSON schema, same formatting, same level of detail.
Format your seed examples as JSONL:
{"input": "The dashboard keeps crashing when I try to export reports. This has been happening since the last update.", "output": "{\"category\": \"bug_report\", \"confidence\": 0.95, \"product_area\": \"dashboard\", \"summary\": \"Dashboard crashes on report export since last update\"}"}
{"input": "It would be amazing if you could add a dark mode to the mobile app.", "output": "{\"category\": \"feature_request\", \"confidence\": 0.92, \"product_area\": \"mobile_app\", \"summary\": \"Request for dark mode on mobile app\"}"}
Spend 1-2 hours on this step. The quality of your seed examples directly determines the quality of your final model.
Step 3: Generate 2,000 Synthetic Examples Using the Teacher Model
Now you scale up. You will use Claude or GPT-4o as the teacher model to generate 2,000 labelled examples.
3a: Write the Teacher Prompt
Use your seed examples to craft a system prompt for the teacher model. The prompt should:
- Describe the task precisely
- Include 3-5 of your seed examples as few-shot demonstrations
- Specify the exact output format
- Include edge case guidance
You are a product feedback classifier. Given a piece of user feedback about a SaaS product, classify it and extract the relevant product area.
Output a JSON object with these fields:
- category: one of [bug_report, feature_request, usability_issue, performance_complaint, positive_feedback, question]
- confidence: float between 0.0 and 1.0
- product_area: the affected product area (e.g., dashboard, api, mobile_app, billing, onboarding, integrations, general)
- summary: one-sentence summary of the feedback
Examples:
[Include 3-5 seed examples here]
Classify the following feedback:
3b: Generate Diverse Inputs
You need diverse inputs to feed to the teacher. There are several sources:
Production data. If you have historical feedback, this is ideal. Real data has the right distribution, vocabulary, and edge cases.
Synthetic input generation. Use the teacher model itself to generate realistic inputs. Prompt it with: "Generate 50 diverse examples of SaaS product feedback that cover all six categories. Make them realistic, varying in length (1-5 sentences), tone (frustrated, neutral, excited), and specificity."
Template-based generation. Create templates with slots and fill them programmatically. "The {feature} is {problem_description} when I try to {action}."
Aim for 2,500 raw inputs. You will lose some during quality filtering.
3c: Run the Teacher Model
Process all 2,500 inputs through the teacher model. This is the most expensive step in the pipeline, but it is a one-time cost.
Cost estimate for 2,500 examples:
- Claude Sonnet: ~$3-5 (at ~200 input tokens + 100 output tokens per example)
- GPT-4o: ~$4-7
Use batch processing to reduce costs. Both Anthropic and OpenAI offer batch APIs with 50% discounts. With batching:
- Claude Sonnet batch: ~$1.50-2.50
- GPT-4o batch: ~$2-3.50
Save all input-output pairs as JSONL. This is your raw training dataset.
3d: Rate Limit and Error Handling
When running 2,500 API calls, you will hit rate limits. Implement:
- Exponential backoff with jitter (start at 1s, max 60s)
- Retry logic for 429 and 500 errors (3 retries max)
- Checkpointing — save progress every 100 examples so you can resume if the script crashes
- Parallel requests — 5-10 concurrent requests is usually safe for standard API tiers
A well-implemented script processes 2,500 examples in 15-30 minutes.
Step 4: Quality Filter the Dataset
Raw teacher outputs are not all usable. Expect to discard ~25% of examples during quality filtering. This is normal and important.
Automated Filters
Run these checks programmatically:
-
Schema validation. Parse every output as JSON. Discard any example where the output is not valid JSON or does not match your expected schema. (Expect 2-5% failure rate.)
-
Category validation. Check that the category field contains one of the valid values. The teacher model occasionally hallucinates a category that does not exist. (Expect 1-3% failure rate.)
-
Confidence thresholding. Discard examples where the teacher's confidence is below 0.70. Low-confidence outputs are often incorrect or ambiguous, and training on them confuses the student. (Expect 5-10% removal.)
-
Deduplication. Remove near-duplicate inputs. Use cosine similarity on embeddings with a threshold of 0.95. Training on near-duplicates wastes compute and biases the model. (Expect 5-10% removal.)
-
Length outliers. Remove examples where the input or output is abnormally long or short (beyond 2 standard deviations from the mean). These are often malformed. (Expect 2-3% removal.)
Manual Spot-Check
After automated filtering, randomly sample 100 examples and review them manually. Look for:
- Teacher errors (wrong category, incorrect extraction)
- Inconsistent formatting (sometimes includes markdown, sometimes does not)
- Ambiguous examples where the "correct" answer is debatable
If more than 10% of your spot-check sample has issues, your teacher prompt needs improvement. Go back to Step 3a and refine.
Expected outcome: Start with 2,500 examples, end with ~1,800-2,000 clean examples. This is plenty for fine-tuning.
Step 5: Fine-Tune in Ertas Studio
Upload your filtered JSONL dataset to Ertas Studio. Here is the configuration for a standard distillation run.
Model Selection
For this tutorial, choose Llama 3.3 8B or Qwen 2.5 7B. Both are excellent distillation targets.
- Llama 3.3 8B — larger community, more tutorials if you hit issues
- Qwen 2.5 7B — slightly better on benchmarks, Apache 2.0 license
LoRA Configuration
| Parameter | Value | Notes |
|---|---|---|
| LoRA rank | 16 | Sufficient for classification/extraction. Use 32 for more complex tasks. |
| LoRA alpha | 32 | Standard: 2x the rank |
| Target modules | q_proj, k_proj, v_proj, o_proj | Attention layers only. Add gate_proj, up_proj, down_proj if quality is insufficient. |
| Dropout | 0.05 | Light regularisation |
Training Parameters
| Parameter | Value | Notes |
|---|---|---|
| Learning rate | 2e-4 | Standard for LoRA distillation |
| LR scheduler | Cosine | With warmup ratio 0.03 |
| Batch size | 4 | Increase to 8 if you have sufficient VRAM |
| Gradient accumulation | 4 | Effective batch size = 16 |
| Epochs | 3 | Start here. Add epochs only if eval loss is still decreasing. |
| Max sequence length | 512 | Increase if your inputs/outputs are longer |
| Weight decay | 0.01 | Standard regularisation |
| Warmup ratio | 0.03 | ~3% of training steps |
Training Time and Cost
With 2,000 examples and the configuration above:
- On Ertas Studio: 30-45 minutes, cost ~$8-12
- On your own A100 (80GB): 20-30 minutes
- On your own RTX 4090 (24GB): 40-60 minutes (with QLoRA quantisation)
Ertas Studio shows real-time training metrics: loss curve, learning rate schedule, and evaluation metrics at each epoch boundary.
What to Watch During Training
- Training loss should decrease steadily through epochs 1-2 and begin to plateau in epoch 3
- Evaluation loss should track training loss closely. If eval loss starts increasing while training loss decreases, you are overfitting. Stop training.
- If loss spikes in the first 100 steps, your learning rate is too high. Try 1e-4.
- If loss barely decreases, your learning rate is too low. Try 5e-4.
Step 6: Evaluate Against the Teacher Model
After training completes, evaluate the student model on your held-out test set (the 20 seed examples you reserved, plus 10% of your training data held out by Ertas automatically).
Metrics to Check
Classification accuracy: What percentage of examples does the student classify correctly?
- Target: 90%+ agreement with the teacher
- Acceptable: 85%+
- Needs work: Below 85%
Exact match on structured output: What percentage of outputs are exactly correct (all fields match)?
- Target: 85%+
- Acceptable: 75%+
Per-category breakdown: Check accuracy for each category separately. If one category is significantly worse than others, you likely need more training examples for that category.
Side-by-Side Comparison
Ertas Studio provides a side-by-side comparison view where you can see teacher and student outputs for every test example. Review the disagreements manually. Often, you will find that:
- 40-50% of disagreements are cases where the student is actually correct (the teacher made an error)
- 30-40% are cases where both answers are reasonable (ambiguous inputs)
- 10-20% are genuine student errors that could be fixed with more training data
If Quality Is Below Target
If your evaluation scores are below your target:
- Check data quality first. Review 50 random training examples. If more than 5% have incorrect labels, clean your data and retrain.
- Add more data for weak categories. If one category underperforms, generate 200-300 additional examples specifically for that category.
- Increase LoRA rank. Move from rank 16 to rank 32. This gives the model more capacity to learn complex patterns.
- Try a larger model. If Qwen 7B is not sufficient, try Qwen 14B or Llama 8B.
- Add more target modules. Include the MLP layers (gate_proj, up_proj, down_proj) in addition to attention layers.
Step 7: Export to GGUF and Deploy via Ollama
Once you are satisfied with evaluation results, export the model for local deployment.
GGUF Export
In Ertas Studio, select your trained model and click Export. Choose your quantisation level:
| Quantisation | Size (7B model) | Quality Retention | Use Case |
|---|---|---|---|
| Q8_0 | ~7.5 GB | 99%+ | Maximum quality, server deployment |
| Q5_K_M | ~5.0 GB | 97-98% | Good balance, desktop deployment |
| Q4_K_M | ~4.0 GB | 95-97% | Good balance, constrained hardware |
| Q3_K_M | ~3.0 GB | 90-95% | Minimum viable, edge deployment |
Recommendation: Start with Q5_K_M. It offers an excellent quality-size balance. Only drop to Q4 if you have hardware constraints. Only use Q8 if you need the last 1-2% of quality.
Export takes 2-5 minutes. You will get a single .gguf file.
Deploy with Ollama
Create a Modelfile:
FROM ./your-model-Q5_K_M.gguf
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_ctx 512
SYSTEM """You are a product feedback classifier. Given user feedback, classify it and extract the relevant product area. Output a JSON object with: category, confidence, product_area, summary."""
Register and run:
ollama create feedback-classifier -f Modelfile
ollama run feedback-classifier "The search function is incredibly slow, takes 10+ seconds to return results"
Expected output:
{"category": "performance_complaint", "confidence": 0.94, "product_area": "search", "summary": "Search function has 10+ second response times"}
Integration
For production, use Ollama's HTTP API:
curl http://localhost:11434/api/generate -d '{
"model": "feedback-classifier",
"prompt": "The search function is incredibly slow, takes 10+ seconds to return results",
"stream": false
}'
Ollama serves requests at 30-120 tokens/second depending on your hardware. For a typical 50-token classification output, expect 50-150ms response times.
Step 8: Monitor Production Performance and Iterate
Deployment is not the end. Production data will reveal edge cases your training data did not cover.
Monitoring Checklist
-
Log all inputs and outputs. Every production request is a potential training example for the next iteration.
-
Track confidence scores. If the model outputs confidence below 0.70, flag the request for review. A rising percentage of low-confidence requests indicates distribution drift.
-
Sample-based quality audits. Review 50 random production outputs weekly. Calculate your real-world accuracy and compare it to your evaluation set accuracy.
-
Category distribution monitoring. Track the distribution of predicted categories over time. If the distribution shifts significantly, investigate whether the input distribution has changed or the model is drifting.
-
Latency monitoring. Track p50, p95, and p99 latency. If latency increases, check for resource contention on the deployment server.
Iteration Cycle
Every 2-4 weeks (or when quality drops below your threshold):
- Collect flagged low-confidence examples and incorrect predictions
- Correct the labels manually (or use the teacher model for ambiguous cases)
- Add 200-500 new examples to your training set
- Retrain (incremental training takes 15-20 minutes)
- Evaluate and redeploy
Each iteration typically improves accuracy by 1-3 percentage points. After 3-4 iterations, most models converge to their quality ceiling for the given model size.
Ship AI that runs on your users' devices.
Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Expected Results and Cost Summary
Quality: 85-95% agreement with the teacher model on your specific task. On many tasks (classification, extraction), the fine-tuned model exceeds 90%.
Total cost breakdown:
| Step | Cost |
|---|---|
| Seed example creation | $0 (your time: 1-2 hours) |
| Teacher model API calls (2,500 examples) | $2-7 |
| Ertas Studio training | $8-12 |
| GGUF export | Included |
| Total | $10-19 |
Compare this to the ongoing cost of running the teacher model in production:
- At 10,000 requests/month: $10-50/month in API costs
- At 100,000 requests/month: $100-500/month
- Break-even: 1-4 weeks
Time estimate:
- Step 1 (task definition): 30 minutes
- Step 2 (seed examples): 1-2 hours
- Step 3 (teacher generation): 30 minutes active, 15-30 minutes processing
- Step 4 (quality filtering): 30-45 minutes
- Step 5 (training): 10 minutes active, 30-45 minutes waiting
- Step 6 (evaluation): 20-30 minutes
- Step 7 (export and deploy): 15-20 minutes
- Total: 3-5 hours including wait times
You started the morning paying per token. By lunch, you own the model.
For the technical foundations behind this tutorial, read our Model Distillation with LoRA guide. Want to do this with fully open-source models and zero legal risk? See How to Distill Open-Source Models Legally. For the ethics and strategy behind distillation, read Model Distillation Is Not Theft — But Here's Why You Should Do It Yourself. New to Ertas? Start with Getting Started with Ertas.
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

How to Distill Open-Source Models Legally: A Step-by-Step Guide
A practical guide to model distillation the right way: using open-source teacher models with permissive licenses, your own domain data, and a clear legal path to model ownership.

Fine-Tuning Llama 3: A Practical Guide for Your Use Case
A hands-on guide to fine-tuning Meta's Llama 3 models — covering model selection, dataset preparation, LoRA configuration, training tips, and deployment as GGUF for local inference.

Getting Started with Ertas: Fine-Tune and Deploy Custom AI Models
A step-by-step guide to uploading datasets, fine-tuning models in Ertas Studio, and deploying GGUF models — all without ML expertise.