
How to Fine-Tune an LLM: The Complete 2026 Guide
Learn how to fine-tune large language models step by step — from preparing training data and choosing a base model to configuring LoRA, evaluating results, and deploying locally.
To fine-tune an LLM, you prepare a JSONL dataset of instruction-response pairs, select a base model (typically 7B-8B parameters), apply LoRA or QLoRA adapters, train for 1-5 epochs, and export the result as a GGUF file for deployment. The entire process takes 30 minutes to a few hours depending on dataset size and hardware. According to Hugging Face, parameter-efficient fine-tuning methods like LoRA can reduce trainable parameters by over 99% while achieving results within 2-5% of full fine-tuning. Research from the Stanford HAI AI Index Report shows that fine-tuned smaller models consistently outperform larger prompted models on domain-specific tasks, making fine-tuning one of the most cost-effective ways to build production AI.
This guide walks through the entire process: when fine-tuning makes sense, how to prepare your data, which base model to pick, how to configure training, and how to deploy the result.
When Should You Fine-Tune?
Fine-tuning isn't always the right answer. Before you invest time preparing data and running training jobs, consider whether your problem actually requires it.
Fine-tuning makes sense when:
- Prompt engineering hits a ceiling. You've tried few-shot examples, chain-of-thought prompting, and system instructions, but the model still doesn't produce consistent results for your domain.
- You need a specific output format. Your application requires structured JSON, a particular writing style, or domain-specific terminology that base models struggle to produce reliably.
- Latency and cost matter at scale. A fine-tuned 7B model can outperform a prompted 70B model on narrow tasks — at a fraction of the inference cost.
- Privacy requirements prohibit cloud APIs. Fine-tuned models can run entirely on your infrastructure, keeping sensitive data off third-party servers.
Consider alternatives when:
- Your task is broad and changes frequently — prompt engineering or RAG may be more flexible.
- You have fewer than 100 quality training examples — fine-tuning needs enough data to learn patterns without overfitting.
- You need the model to access external knowledge that changes often — retrieval-augmented generation handles this better.
For a deeper comparison, see our guide on fine-tuning vs RAG.
Step 1: Prepare Your Training Data
Data quality is the single biggest factor in fine-tuning success. A model trained on 500 excellent examples will outperform one trained on 10,000 mediocre ones.
Format: JSONL
The standard format for fine-tuning data is JSONL (JSON Lines) — one JSON object per line. Each line typically contains an instruction and the desired response:
{"instruction": "Classify this support ticket as billing, technical, or general.", "input": "I can't log in to my account after resetting my password.", "output": "technical"}
{"instruction": "Classify this support ticket as billing, technical, or general.", "input": "When will I be charged for the annual plan?", "output": "billing"}
For conversational models, use a messages format:
{"messages": [{"role": "system", "content": "You are a medical assistant."}, {"role": "user", "content": "What are common side effects of metformin?"}, {"role": "assistant", "content": "Common side effects include nausea, diarrhea, and stomach pain..."}]}
Data Quality Checklist
- Consistent formatting — every example should follow the same structure
- Diverse examples — cover edge cases, not just the happy path
- Accurate labels — garbage in, garbage out. Have domain experts review your data.
- Balanced distribution — if you're training a classifier, roughly equal examples per class prevent the model from defaulting to the majority label
- No data leakage — keep a validation set separate from training data to measure real performance
How Much Data Do You Need?
There's no universal answer, but here are practical starting points:
| Task Type | Minimum Examples | Sweet Spot |
|---|---|---|
| Classification | 100–200 per class | 500–1,000 per class |
| Summarization | 500 | 2,000–5,000 |
| Conversational | 1,000 | 5,000–10,000 |
| Code generation | 500 | 3,000–8,000 |
| Domain Q&A | 300 | 1,000–3,000 |
More data helps, but returns diminish. Focus on quality first, then scale up.
Step 2: Choose a Base Model
Your base model determines your starting point. The right choice depends on your task, hardware constraints, and licensing requirements.
Popular Base Models in 2026
| Model | Sizes | Strengths | License |
|---|---|---|---|
| Llama 3 | 8B, 70B | General purpose, strong reasoning, large community | Meta Community |
| Mistral | 7B, 8x7B | Fast inference, good at code and instruction following | Apache 2.0 |
| Qwen 2.5 | 7B, 14B, 72B | Multilingual, strong on benchmarks | Apache 2.0 |
| Gemma 2 | 2B, 9B, 27B | Efficient, good for resource-constrained deployment | |
| DeepSeek | 7B, 67B | Strong at code and math | DeepSeek License |
| Phi-3 | 3.8B, 14B | Small but capable, good for edge deployment | MIT |
Selection Criteria
- Task fit — models pre-trained on code (DeepSeek, CodeLlama) fine-tune better for code tasks
- Size vs. hardware — a 7B model fine-tunes on a single GPU; 70B needs multi-GPU setups
- License — check if commercial use is permitted for your deployment scenario
- Community support — more popular models have more fine-tuning guides, adapters, and quantized versions available
For most tasks, start with a 7B–8B model. It's large enough to be capable but small enough to fine-tune quickly and deploy on modest hardware.
Step 3: Configure Training
Full Fine-Tuning vs. LoRA
Full fine-tuning updates every weight in the model. It produces the best results but requires significant GPU memory — often multiple high-end GPUs for models above 7B parameters.
LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices that modify the model's behavior. It uses a fraction of the memory and trains much faster, with results that are often within 5% of full fine-tuning.
QLoRA goes further by quantizing the base model to 4-bit precision before applying LoRA adapters, allowing you to fine-tune a 7B model on a single consumer GPU with 8GB VRAM.
For most teams, LoRA or QLoRA is the right choice. Full fine-tuning is reserved for cases where you have significant compute resources and need maximum performance.
Key Hyperparameters
| Parameter | Typical Range | What It Does |
|---|---|---|
| Learning rate | 1e-5 to 5e-4 | How aggressively the model updates its weights. Too high = instability; too low = slow learning. |
| Epochs | 1–5 | How many times the model sees the full dataset. More epochs risk overfitting. |
| Batch size | 4–32 | Examples processed simultaneously. Larger = smoother gradients but more memory. |
| LoRA rank | 8–64 | Adapter capacity. Higher rank = more expressive but more parameters. |
| LoRA alpha | 16–128 | Scaling factor for LoRA updates. Usually set to 2× the rank. |
Practical Starting Configuration
For a 7B model with LoRA on a single GPU:
Learning rate: 2e-4
Epochs: 3
Batch size: 8 (with gradient accumulation)
LoRA rank: 16
LoRA alpha: 32
LoRA target modules: q_proj, v_proj, k_proj, o_proj
Warmup steps: 100
Weight decay: 0.01
Start here and adjust based on validation loss. If loss plateaus early, increase learning rate or rank. If loss spikes, reduce learning rate.
Step 4: Train and Monitor
During training, watch two key metrics:
- Training loss — should decrease steadily. A sudden spike means the learning rate is too high.
- Validation loss — should track training loss. When validation loss starts increasing while training loss continues to decrease, you're overfitting.
Signs of Problems
| Symptom | Likely Cause | Fix |
|---|---|---|
| Loss doesn't decrease | Learning rate too low or data issues | Increase learning rate; check data formatting |
| Loss spikes then recovers | Learning rate too high | Reduce learning rate by 2–5× |
| Validation loss diverges from training | Overfitting | Reduce epochs, add dropout, use more data |
| Output is repetitive or degenerate | Catastrophic forgetting or bad data | Lower learning rate, check data quality |
Training a 7B model on 5,000 examples with LoRA typically takes 30–90 minutes on a single A100 GPU. With Ertas Studio, this runs on managed cloud GPUs so you don't need to provision any hardware.
Step 5: Evaluate Your Model
Don't skip evaluation. A model that scores well on training loss can still produce poor real-world outputs.
Evaluation Methods
- Held-out test set — run the model on examples it hasn't seen during training. Compare outputs against ground truth.
- A/B comparison — generate outputs from both the base model and fine-tuned model on the same prompts. Have domain experts rate which is better.
- Task-specific metrics — accuracy for classification, ROUGE for summarization, exact match for extraction tasks.
- Vibe check — sometimes the most important evaluation is just using the model and seeing if it feels right for your use case.
What Good Looks Like
- The model follows your output format consistently
- Domain terminology is used correctly
- Hallucinations are reduced compared to the base model
- Outputs match the tone and style of your training examples
If results aren't satisfactory, iterate: review training data quality, adjust hyperparameters, or add more examples for the failure cases.
Step 6: Export and Deploy
Once you're happy with your model, export it for deployment. The most common format for local inference is GGUF — an open standard supported by llama.cpp, Ollama, LM Studio, and many other tools.
Why GGUF?
- Quantization built in — reduce model size by 2–4× with minimal quality loss
- CPU inference — runs on consumer hardware without a GPU
- Universal compatibility — works with every major local inference tool
- No vendor lock-in — it's an open format you control
Deployment Options
| Option | Best For | Setup Effort |
|---|---|---|
| Ollama | Quick local testing, API-compatible serving | Minimal |
| LM Studio | Desktop chat interface, non-technical users | Minimal |
| llama.cpp | Maximum control, custom applications | Moderate |
| vLLM | Production serving with high throughput | Moderate |
| Open WebUI | Team-facing ChatGPT-like interface | Moderate |
Example: Deploy with Ollama
After exporting your GGUF from Ertas Studio:
# Create a Modelfile
echo 'FROM ./my-fine-tuned-model.gguf' > Modelfile
# Import into Ollama
ollama create my-model -f Modelfile
# Run inference
ollama run my-model "Classify this ticket: I can't reset my password"
Your fine-tuned model is now running entirely on your hardware. No API calls, no per-token costs, no data leaving your network.
The Faster Way: Ertas Studio
The workflow above involves setting up training environments, writing configuration files, and managing GPU instances. Ertas Studio handles all of that through a visual canvas interface:
- Upload your JSONL dataset — Studio validates your data and flags issues before training starts
- Select a base model — browse available models or import from Hugging Face
- Configure and launch — set hyperparameters visually and start training on managed cloud GPUs
- Compare results — run multiple fine-tuning jobs side by side and compare outputs on the same canvas
- Export as GGUF — download your model and deploy anywhere
No training scripts. No infrastructure to manage. No terminal required.
Lock in early bird pricing at $14.50/mo — this price is guaranteed for life and increases to $34.50/mo at launch. Join the waitlist →
Frequently Asked Questions
How long does it take to fine-tune an LLM?
Fine-tuning time depends on your dataset size, base model, and hardware. Training a 7B model on 5,000 examples with LoRA typically takes 30-90 minutes on a single A100 GPU. Smaller datasets (500-1,000 examples) can finish in under 15 minutes. Using QLoRA on consumer GPUs (RTX 3090/4090) takes 2-4x longer but is still measured in hours, not days. The data preparation step often takes longer than the actual training.
What hardware do I need for fine-tuning?
For LoRA/QLoRA fine-tuning of a 7B model, you need a GPU with at least 8 GB VRAM (e.g., NVIDIA RTX 3060). QLoRA specifically was designed to fine-tune on consumer hardware — a single RTX 4090 with 24 GB VRAM can handle models up to 33B parameters. For full fine-tuning (not recommended for most teams), you need multiple high-end GPUs like A100s. Cloud GPU providers like Lambda Labs, RunPod, or managed services like Ertas Studio eliminate the hardware requirement entirely.
How much training data do I need?
It varies by task complexity. For classification tasks, 100-200 examples per class is the minimum, with 500-1,000 being the sweet spot. Conversational fine-tuning needs at least 1,000 examples, ideally 5,000-10,000. Code generation tasks start at around 500 examples. Quality matters far more than quantity — according to research from Meta AI, 500 high-quality, expert-curated examples often outperform 10,000 noisy ones. Start small, evaluate, and add more data targeting the failure cases.
Can I fine-tune without coding?
Yes. Tools like Ertas Studio, Hugging Face AutoTrain, and OpenAI's fine-tuning API provide visual or simplified interfaces that handle the training pipeline for you. You prepare your JSONL dataset, upload it, select a base model, configure basic parameters, and start training. No Python scripts, no GPU provisioning, and no infrastructure management required.
What to Read Next
- Fine-Tuning vs RAG: When to Use Each — a decision framework for choosing the right approach
- Running AI Models Locally — everything you need to know about local inference
- Getting Started with Ertas — a walkthrough of the Ertas Studio workflow
- The Hidden Cost of Per-Token AI Pricing — why fine-tuned local models save money at scale
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Fine-Tuning for App Developers: A Non-ML-Engineer's Guide
A practical guide to fine-tuning AI models for mobile app developers. Learn LoRA, QLoRA, and GGUF export without needing an ML background.

How to Build an On-Premise Data Preparation Pipeline for LLM Fine-Tuning
A complete guide to building on-premise data preparation pipelines for LLM fine-tuning — covering the 5 stages from ingestion to export, tool comparisons, and architecture for regulated environments.

From Prompt Engineering to Fine-Tuning: The Migration Playbook
A practical playbook for teams migrating from prompt engineering to fine-tuning — when to make the switch, how to convert prompts into training data, and the step-by-step migration process.