Back to blog
    Model Distillation with LoRA: Training Smaller Models from Frontier Outputs
    ml-engineeringdistillationloraqlorasynthetic-datasegment:ml-engineer

    Model Distillation with LoRA: Training Smaller Models from Frontier Outputs

    A technical guide to distilling GPT-4 and Claude outputs into compact, deployable models using LoRA fine-tuning — the practical path from API dependency to model ownership.

    EErtas Team··Updated

    You have a production system running on GPT-4o or Claude. It works well. The quality is excellent. But the costs are climbing, latency is inconsistent, and you are entirely dependent on a third-party API that could change pricing, rate limits, or behaviour at any time.

    Model distillation is the practical engineering path from API dependency to model ownership. You train a smaller, faster model to replicate the larger model's behaviour on your specific tasks — and then you deploy it on your own infrastructure.

    What Model Distillation Actually Is

    Distillation is conceptually simple: a large "teacher" model generates outputs for a set of inputs. A smaller "student" model is then trained to produce the same outputs for the same inputs. The student learns to mimic the teacher's behaviour on the specific distribution of tasks you care about.

    The critical insight is this: you do not need hand-labelled data. The frontier API is the labeller. Every API call you are already making is a potential training example. The teacher model has already done the expensive cognitive work of understanding the task — the student just needs to learn the input-output mapping.

    This is fundamentally different from training a model from scratch. You are not teaching the student to "understand language." The base model already understands language. You are teaching it to perform your specific task the way GPT-4o performs it.

    The Modern Distillation Workflow

    The workflow has three stages: data generation, curation, and fine-tuning.

    Stage 1: Generate Synthetic Training Data

    Start by systematically generating teacher outputs. There are two approaches:

    Log-based collection. If your system is already in production, you have API call logs. Every input–output pair is a training example. This is the highest-quality data source because it reflects your actual production distribution.

    Synthetic generation. If you need more data or want to cover edge cases, generate additional examples programmatically. Create diverse inputs that span your task space and run them through the teacher model. For a transaction categoriser, this might mean generating thousands of varied transaction descriptions and getting GPT-4o to categorise each one.

    The combination of both approaches is ideal. Production logs give you distributional accuracy; synthetic generation gives you coverage of the long tail.

    Stage 2: Curate Aggressively

    Here is the insight that separates successful distillation from failed attempts: data quality matters exponentially more than data quantity.

    5,000 carefully curated examples will produce a better student model than 50,000 noisy ones. Curation means:

    • Remove teacher failures. The frontier model is not perfect. Filter out examples where the output is clearly wrong, incomplete, or inconsistent.
    • Deduplicate. Near-duplicate examples waste training compute and bias the model toward common cases.
    • Balance the distribution. If 80% of your examples are one category, the student will over-index on that category. Undersample the majority class or oversample the minority.
    • Verify format consistency. If you expect JSON output, ensure every training example produces valid JSON. If you expect a specific schema, validate against it.

    Spending an extra day on curation is worth more than an extra week of training on unfiltered data.

    Stage 3: Fine-Tune with LoRA

    With a curated dataset in hand, you train the student model using LoRA (Low-Rank Adaptation). LoRA is the preferred fine-tuning method for distillation because:

    • Parameter efficiency. LoRA only trains 0.1–1% of the model's parameters. A 7B model has ~7 billion parameters; a LoRA adapter might train 10–50 million. This means faster training, lower GPU memory requirements, and smaller artefacts.
    • Rapid iteration. Training a LoRA adapter takes 30–90 minutes on a single GPU for typical dataset sizes. You can run multiple experiments per day, testing different hyperparameters, data subsets, or base models.
    • Composability. LoRA adapters are modular. You can train separate adapters for different tasks and swap them at inference time. A single base model can serve multiple distilled capabilities.
    • Small artefacts. A LoRA adapter is 50–200MB. A full fine-tuned 7B model is 14GB. For version control, sharing, and deployment, the size difference matters.

    For distillation specifically, QLoRA (quantised LoRA) is worth considering. It applies LoRA on top of a 4-bit quantised base model, reducing GPU memory requirements by roughly 4x with minimal quality loss. This means you can fine-tune a 13B model on a single 24GB GPU.

    Practical Example: Transaction Categorisation

    To make this concrete, consider a real distillation project: replacing GPT-4o for automatic transaction categorisation in a fintech application.

    Teacher setup. GPT-4o categorises bank transactions into 47 categories with 96% accuracy on a held-out test set. Latency averages 800ms per request. Cost at production volume: ~$3,200/month.

    Data collection. 12,000 production API call logs collected over 3 weeks, plus 3,000 synthetic examples covering rare categories. After curation: 8,500 high-quality examples.

    Student training. Qwen 2.5 7B as the base model. LoRA rank 32, alpha 64, learning rate 2e-4, 3 epochs. Training time: 48 minutes on a single A10G.

    Results. The distilled student achieves 93% agreement with GPT-4o on the held-out test set. On actual production inputs, the agreement is 94.2%. Latency: 50ms per request (16x faster). Infrastructure cost: $150/month for a GPU VPS (95% cost reduction).

    The 3% accuracy gap vs the teacher is acceptable for this use case — and on some subcategories, the student actually outperforms the teacher because the training data corrected for teacher inconsistencies.

    Common Pitfalls

    Distribution mismatch. If your synthetic training data does not match your production input distribution, the student will perform well on benchmarks and poorly in production. Always include real production data in your training set.

    Overfitting on teacher quirks. Frontier models have idiosyncratic behaviours — formatting preferences, hedging language, occasional hallucinations. If these quirks are in your training data, the student will faithfully reproduce them. Curate these out.

    Not evaluating on real-world inputs. Do not just measure agreement with the teacher on a test set. Measure task-specific metrics (accuracy, F1, user satisfaction) on actual production traffic. The student might disagree with the teacher but still produce correct outputs.

    Training too long. LoRA fine-tuning converges quickly. Most distillation runs peak in quality within 2–4 epochs. Training beyond that risks overfitting, especially on smaller datasets.

    How Ertas Streamlines the Pipeline

    Ertas is built for exactly this workflow.

    Vault handles dataset management — upload production logs or synthetic data, version your datasets, and track provenance. Built-in deduplication and format validation catch common data quality issues before they reach training.

    Studio provides the LoRA training pipeline with sensible defaults for distillation. Select your base model, upload your dataset, configure rank and learning rate, and launch. Experiment tracking lets you compare adapter versions side-by-side on your evaluation metrics.

    GGUF export produces a deployment-ready model file. Merge your best adapter with the base model, quantise to your target precision, and download a single file ready for Ollama or any GGUF-compatible runtime.

    The full cycle — from dataset upload to deployed model — takes hours, not weeks.

    Move from API Dependency to Model Ownership

    Distillation is not about replacing frontier models entirely. It is about owning the models that run your production workloads, with predictable costs, controlled latency, and no dependency on third-party API decisions.

    Ertas early-access pricing is locked at $14.50/month for the complete pipeline: data management, LoRA training, experiment tracking, and GGUF export.

    Join the waitlist and start distilling.


    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading