Distilling Claude/GPT into a 7B Model for Production: Step-by-Step

You have a system running on Claude Sonnet or GPT-4o. It works. The quality is solid. But every request costs money, every request adds latency, and every request sends your data to someone else's servers. You want to own the model that powers your product.

This tutorial walks you through the complete distillation pipeline: from defining your task through generating synthetic data, fine-tuning a 7B model, and deploying it locally via Ollama. Every step includes concrete parameters, expected outputs, and troubleshooting guidance.

Total time: 3-4 hours (including training wait time) Total cost: $10-30 for the entire pipeline Expected result: 85-95% of teacher model quality on your specific task

Prerequisites

Before you start, you need:

An Ertas account (free tier is sufficient for this tutorial)
API access to your teacher model — either an Anthropic API key (for Claude) or an OpenAI API key (for GPT-4o)
A clearly defined task — classification, extraction, formatting, or domain Q&A
50 seed examples of your task, with correct input-output pairs
A machine for local deployment — 8GB+ VRAM GPU or 16GB+ RAM for CPU inference

Let us walk through each step.

Step 1: Define Your Task Scope

Distillation works on narrow tasks. The narrower, the better. Your task definition should fit in one sentence:

Good task definitions:

"Classify customer support emails into one of 8 categories and return a JSON object with category and confidence."
"Extract vendor name, invoice number, date, and total amount from invoice PDFs (pre-processed to text)."
"Convert natural language search queries into Elasticsearch DSL queries for our product catalogue schema."
"Answer questions about our API documentation using the relevant docs section as context."

Bad task definitions (too broad):

"Help customers with their questions."
"Process documents."
"Write responses to emails."

Your task definition should specify:

Input format: What does the model receive? How long is a typical input? What is the range?
Output format: What exactly should the model produce? JSON? A label? Structured text?
Output space: How many possible outputs are there? (12 categories, 5 extracted fields, etc.)
Quality metric: How will you measure if the output is correct? Accuracy? F1? Exact match?

Write this down. You will refer to it throughout the process.

For this tutorial, we will use a running example: classifying SaaS product feedback into 6 categories (bug report, feature request, usability issue, performance complaint, positive feedback, question) and extracting the affected product area.

Step 2: Create 50 Seed Examples Manually

This is the step that feels tedious but determines the quality of everything downstream. You need 50 hand-crafted input-output pairs that represent the gold standard for your task.

Why 50? Because these seed examples serve three purposes:

Prompt calibration — they help you write the prompt for the teacher model
Quality reference — they define what "correct" looks like
Test set foundation — 20 of them will become your initial evaluation set

Guidelines for seed examples:

Cover all categories/output types. If you have 6 categories, include at least 8 examples per category.
Include edge cases. Add 5-10 examples that are ambiguous, multi-label, or unusual.
Match production distribution. If 40% of your real inputs are bug reports, roughly 40% of seeds should be bug reports.
Be precise in outputs. Your outputs should be exactly what you want the model to produce — same JSON schema, same formatting, same level of detail.

Format your seed examples as JSONL:

{"input": "The dashboard keeps crashing when I try to export reports. This has been happening since the last update.", "output": "{\"category\": \"bug_report\", \"confidence\": 0.95, \"product_area\": \"dashboard\", \"summary\": \"Dashboard crashes on report export since last update\"}"}
{"input": "It would be amazing if you could add a dark mode to the mobile app.", "output": "{\"category\": \"feature_request\", \"confidence\": 0.92, \"product_area\": \"mobile_app\", \"summary\": \"Request for dark mode on mobile app\"}"}

Spend 1-2 hours on this step. The quality of your seed examples directly determines the quality of your final model.

Step 3: Generate 2,000 Synthetic Examples Using the Teacher Model

Now you scale up. You will use Claude or GPT-4o as the teacher model to generate 2,000 labelled examples.

3a: Write the Teacher Prompt

Use your seed examples to craft a system prompt for the teacher model. The prompt should:

Describe the task precisely
Include 3-5 of your seed examples as few-shot demonstrations
Specify the exact output format
Include edge case guidance

You are a product feedback classifier. Given a piece of user feedback about a SaaS product, classify it and extract the relevant product area.

Output a JSON object with these fields:
- category: one of [bug_report, feature_request, usability_issue, performance_complaint, positive_feedback, question]
- confidence: float between 0.0 and 1.0
- product_area: the affected product area (e.g., dashboard, api, mobile_app, billing, onboarding, integrations, general)
- summary: one-sentence summary of the feedback

Examples:
[Include 3-5 seed examples here]

Classify the following feedback:

3b: Generate Diverse Inputs

You need diverse inputs to feed to the teacher. There are several sources:

Production data. If you have historical feedback, this is ideal. Real data has the right distribution, vocabulary, and edge cases.

Synthetic input generation. Use the teacher model itself to generate realistic inputs. Prompt it with: "Generate 50 diverse examples of SaaS product feedback that cover all six categories. Make them realistic, varying in length (1-5 sentences), tone (frustrated, neutral, excited), and specificity."

Template-based generation. Create templates with slots and fill them programmatically. "The {feature} is {problem_description} when I try to {action}."

Aim for 2,500 raw inputs. You will lose some during quality filtering.

3c: Run the Teacher Model

Process all 2,500 inputs through the teacher model. This is the most expensive step in the pipeline, but it is a one-time cost.

Cost estimate for 2,500 examples:

Claude Sonnet: ~$3-5 (at ~200 input tokens + 100 output tokens per example)
GPT-4o: ~$4-7

Use batch processing to reduce costs. Both Anthropic and OpenAI offer batch APIs with 50% discounts. With batching:

Claude Sonnet batch: ~$1.50-2.50
GPT-4o batch: ~$2-3.50

Save all input-output pairs as JSONL. This is your raw training dataset.

3d: Rate Limit and Error Handling

When running 2,500 API calls, you will hit rate limits. Implement:

Exponential backoff with jitter (start at 1s, max 60s)
Retry logic for 429 and 500 errors (3 retries max)
Checkpointing — save progress every 100 examples so you can resume if the script crashes
Parallel requests — 5-10 concurrent requests is usually safe for standard API tiers

A well-implemented script processes 2,500 examples in 15-30 minutes.

Step 4: Quality Filter the Dataset

Raw teacher outputs are not all usable. Expect to discard ~25% of examples during quality filtering. This is normal and important.

Automated Filters

Run these checks programmatically:

Schema validation. Parse every output as JSON. Discard any example where the output is not valid JSON or does not match your expected schema. (Expect 2-5% failure rate.)
Category validation. Check that the category field contains one of the valid values. The teacher model occasionally hallucinates a category that does not exist. (Expect 1-3% failure rate.)
Confidence thresholding. Discard examples where the teacher's confidence is below 0.70. Low-confidence outputs are often incorrect or ambiguous, and training on them confuses the student. (Expect 5-10% removal.)
Deduplication. Remove near-duplicate inputs. Use cosine similarity on embeddings with a threshold of 0.95. Training on near-duplicates wastes compute and biases the model. (Expect 5-10% removal.)
Length outliers. Remove examples where the input or output is abnormally long or short (beyond 2 standard deviations from the mean). These are often malformed. (Expect 2-3% removal.)

Manual Spot-Check

After automated filtering, randomly sample 100 examples and review them manually. Look for:

Teacher errors (wrong category, incorrect extraction)
Inconsistent formatting (sometimes includes markdown, sometimes does not)
Ambiguous examples where the "correct" answer is debatable

If more than 10% of your spot-check sample has issues, your teacher prompt needs improvement. Go back to Step 3a and refine.

Expected outcome: Start with 2,500 examples, end with ~1,800-2,000 clean examples. This is plenty for fine-tuning.

Step 5: Fine-Tune in Ertas Studio

Upload your filtered JSONL dataset to Ertas Studio. Here is the configuration for a standard distillation run.

Model Selection

For this tutorial, choose Llama 3.3 8B or Qwen 2.5 7B. Both are excellent distillation targets.

Llama 3.3 8B — larger community, more tutorials if you hit issues
Qwen 2.5 7B — slightly better on benchmarks, Apache 2.0 license

LoRA Configuration

Parameter	Value	Notes
LoRA rank	16	Sufficient for classification/extraction. Use 32 for more complex tasks.
LoRA alpha	32	Standard: 2x the rank
Target modules	q_proj, k_proj, v_proj, o_proj	Attention layers only. Add gate_proj, up_proj, down_proj if quality is insufficient.
Dropout	0.05	Light regularisation

Training Parameters

Parameter	Value	Notes
Learning rate	2e-4	Standard for LoRA distillation
LR scheduler	Cosine	With warmup ratio 0.03
Batch size	4	Increase to 8 if you have sufficient VRAM
Gradient accumulation	4	Effective batch size = 16
Epochs	3	Start here. Add epochs only if eval loss is still decreasing.
Max sequence length	512	Increase if your inputs/outputs are longer
Weight decay	0.01	Standard regularisation
Warmup ratio	0.03	~3% of training steps

Training Time and Cost

With 2,000 examples and the configuration above:

On Ertas Studio: 30-45 minutes, cost ~$8-12
On your own A100 (80GB): 20-30 minutes
On your own RTX 4090 (24GB): 40-60 minutes (with QLoRA quantisation)

Ertas Studio shows real-time training metrics: loss curve, learning rate schedule, and evaluation metrics at each epoch boundary.

What to Watch During Training

Training loss should decrease steadily through epochs 1-2 and begin to plateau in epoch 3
Evaluation loss should track training loss closely. If eval loss starts increasing while training loss decreases, you are overfitting. Stop training.
If loss spikes in the first 100 steps, your learning rate is too high. Try 1e-4.
If loss barely decreases, your learning rate is too low. Try 5e-4.

Step 6: Evaluate Against the Teacher Model

After training completes, evaluate the student model on your held-out test set (the 20 seed examples you reserved, plus 10% of your training data held out by Ertas automatically).

Metrics to Check

Classification accuracy: What percentage of examples does the student classify correctly?

Target: 90%+ agreement with the teacher
Acceptable: 85%+
Needs work: Below 85%

Exact match on structured output: What percentage of outputs are exactly correct (all fields match)?

Target: 85%+
Acceptable: 75%+

Per-category breakdown: Check accuracy for each category separately. If one category is significantly worse than others, you likely need more training examples for that category.

Side-by-Side Comparison

Ertas Studio provides a side-by-side comparison view where you can see teacher and student outputs for every test example. Review the disagreements manually. Often, you will find that:

40-50% of disagreements are cases where the student is actually correct (the teacher made an error)
30-40% are cases where both answers are reasonable (ambiguous inputs)
10-20% are genuine student errors that could be fixed with more training data

If Quality Is Below Target

If your evaluation scores are below your target:

Check data quality first. Review 50 random training examples. If more than 5% have incorrect labels, clean your data and retrain.
Add more data for weak categories. If one category underperforms, generate 200-300 additional examples specifically for that category.
Increase LoRA rank. Move from rank 16 to rank 32. This gives the model more capacity to learn complex patterns.
Try a larger model. If Qwen 7B is not sufficient, try Qwen 14B or Llama 8B.
Add more target modules. Include the MLP layers (gate_proj, up_proj, down_proj) in addition to attention layers.

Step 7: Export to GGUF and Deploy via Ollama

Once you are satisfied with evaluation results, export the model for local deployment.

GGUF Export

In Ertas Studio, select your trained model and click Export. Choose your quantisation level:

Quantisation	Size (7B model)	Quality Retention	Use Case
Q8_0	~7.5 GB	99%+	Maximum quality, server deployment
Q5_K_M	~5.0 GB	97-98%	Good balance, desktop deployment
Q4_K_M	~4.0 GB	95-97%	Good balance, constrained hardware
Q3_K_M	~3.0 GB	90-95%	Minimum viable, edge deployment

Recommendation: Start with Q5_K_M. It offers an excellent quality-size balance. Only drop to Q4 if you have hardware constraints. Only use Q8 if you need the last 1-2% of quality.

Export takes 2-5 minutes. You will get a single .gguf file.

Deploy with Ollama

Create a Modelfile:

FROM ./your-model-Q5_K_M.gguf

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_ctx 512

SYSTEM """You are a product feedback classifier. Given user feedback, classify it and extract the relevant product area. Output a JSON object with: category, confidence, product_area, summary."""

ollama create feedback-classifier -f Modelfile
ollama run feedback-classifier "The search function is incredibly slow, takes 10+ seconds to return results"

Expected output:

{"category": "performance_complaint", "confidence": 0.94, "product_area": "search", "summary": "Search function has 10+ second response times"}

Integration

For production, use Ollama's HTTP API:

curl http://localhost:11434/api/generate -d '{
  "model": "feedback-classifier",
  "prompt": "The search function is incredibly slow, takes 10+ seconds to return results",
  "stream": false
}'

Ollama serves requests at 30-120 tokens/second depending on your hardware. For a typical 50-token classification output, expect 50-150ms response times.

Step 8: Monitor Production Performance and Iterate

Deployment is not the end. Production data will reveal edge cases your training data did not cover.

Monitoring Checklist

Log all inputs and outputs. Every production request is a potential training example for the next iteration.
Track confidence scores. If the model outputs confidence below 0.70, flag the request for review. A rising percentage of low-confidence requests indicates distribution drift.
Sample-based quality audits. Review 50 random production outputs weekly. Calculate your real-world accuracy and compare it to your evaluation set accuracy.
Category distribution monitoring. Track the distribution of predicted categories over time. If the distribution shifts significantly, investigate whether the input distribution has changed or the model is drifting.
Latency monitoring. Track p50, p95, and p99 latency. If latency increases, check for resource contention on the deployment server.

Iteration Cycle

Every 2-4 weeks (or when quality drops below your threshold):

Collect flagged low-confidence examples and incorrect predictions
Correct the labels manually (or use the teacher model for ambiguous cases)
Add 200-500 new examples to your training set
Retrain (incremental training takes 15-20 minutes)
Evaluate and redeploy

Each iteration typically improves accuracy by 1-3 percentage points. After 3-4 iterations, most models converge to their quality ceiling for the given model size.

Ship AI that runs on your users' devices.

Ertas early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →

Expected Results and Cost Summary

Quality: 85-95% agreement with the teacher model on your specific task. On many tasks (classification, extraction), the fine-tuned model exceeds 90%.

Total cost breakdown:

Step	Cost
Seed example creation	$0 (your time: 1-2 hours)
Teacher model API calls (2,500 examples)	$2-7
Ertas Studio training	$8-12
GGUF export	Included
Total	$10-19

Compare this to the ongoing cost of running the teacher model in production:

At 10,000 requests/month: $10-50/month in API costs
At 100,000 requests/month: $100-500/month
Break-even: 1-4 weeks

Time estimate:

Step 1 (task definition): 30 minutes
Step 2 (seed examples): 1-2 hours
Step 3 (teacher generation): 30 minutes active, 15-30 minutes processing
Step 4 (quality filtering): 30-45 minutes
Step 5 (training): 10 minutes active, 30-45 minutes waiting
Step 6 (evaluation): 20-30 minutes
Step 7 (export and deploy): 15-20 minutes
Total: 3-5 hours including wait times

You started the morning paying per token. By lunch, you own the model.

For the technical foundations behind this tutorial, read our Model Distillation with LoRA guide. Want to do this with fully open-source models and zero legal risk? See How to Distill Open-Source Models Legally. For the ethics and strategy behind distillation, read Model Distillation Is Not Theft — But Here's Why You Should Do It Yourself. New to Ertas? Start with Getting Started with Ertas.