To fine-tune an LLM, you prepare a JSONL dataset of instruction-response pairs, select a base model (typically 7B-8B parameters), apply LoRA or QLoRA adapters, train for 1-5 epochs, and export the result as a GGUF file for deployment. The entire process takes 30 minutes to a few hours depending on dataset size and hardware. According to Hugging Face, parameter-efficient fine-tuning methods like LoRA can reduce trainable parameters by over 99% while achieving results within 2-5% of full fine-tuning. Research from the Stanford HAI AI Index Report shows that fine-tuned smaller models consistently outperform larger prompted models on domain-specific tasks, making fine-tuning one of the most cost-effective ways to build production AI.

This guide walks through the entire process: when fine-tuning makes sense, how to prepare your data, which base model to pick, how to configure training, and how to deploy the result.

When Should You Fine-Tune?

Fine-tuning isn't always the right answer. Before you invest time preparing data and running training jobs, consider whether your problem actually requires it.

Fine-tuning makes sense when:

Prompt engineering hits a ceiling. You've tried few-shot examples, chain-of-thought prompting, and system instructions, but the model still doesn't produce consistent results for your domain.
You need a specific output format. Your application requires structured JSON, a particular writing style, or domain-specific terminology that base models struggle to produce reliably.
Latency and cost matter at scale. A fine-tuned 7B model can outperform a prompted 70B model on narrow tasks — at a fraction of the inference cost.
Privacy requirements prohibit cloud APIs. Fine-tuned models can run entirely on your infrastructure, keeping sensitive data off third-party servers.

Consider alternatives when:

Your task is broad and changes frequently — prompt engineering or RAG may be more flexible.
You have fewer than 100 quality training examples — fine-tuning needs enough data to learn patterns without overfitting.
You need the model to access external knowledge that changes often — retrieval-augmented generation handles this better.

For a deeper comparison, see our guide on fine-tuning vs RAG.

Step 1: Prepare Your Training Data

Data quality is the single biggest factor in fine-tuning success. A model trained on 500 excellent examples will outperform one trained on 10,000 mediocre ones.

Format: JSONL

The standard format for fine-tuning data is JSONL (JSON Lines) — one JSON object per line. Each line typically contains an instruction and the desired response:

{"instruction": "Classify this support ticket as billing, technical, or general.", "input": "I can't log in to my account after resetting my password.", "output": "technical"}
{"instruction": "Classify this support ticket as billing, technical, or general.", "input": "When will I be charged for the annual plan?", "output": "billing"}

For conversational models, use a messages format:

{"messages": [{"role": "system", "content": "You are a medical assistant."}, {"role": "user", "content": "What are common side effects of metformin?"}, {"role": "assistant", "content": "Common side effects include nausea, diarrhea, and stomach pain..."}]}

Data Quality Checklist

Consistent formatting — every example should follow the same structure
Diverse examples — cover edge cases, not just the happy path
Accurate labels — garbage in, garbage out. Have domain experts review your data.
Balanced distribution — if you're training a classifier, roughly equal examples per class prevent the model from defaulting to the majority label
No data leakage — keep a validation set separate from training data to measure real performance

How Much Data Do You Need?

There's no universal answer, but here are practical starting points:

Task Type	Minimum Examples	Sweet Spot
Classification	100–200 per class	500–1,000 per class
Summarization	500	2,000–5,000
Conversational	1,000	5,000–10,000
Code generation	500	3,000–8,000
Domain Q&A	300	1,000–3,000

More data helps, but returns diminish. Focus on quality first, then scale up.

Step 2: Choose a Base Model

Your base model determines your starting point. The right choice depends on your task, hardware constraints, and licensing requirements.

Popular Base Models in 2026

Model	Sizes	Strengths	License
Llama 3	8B, 70B	General purpose, strong reasoning, large community	Meta Community
Mistral	7B, 8x7B	Fast inference, good at code and instruction following	Apache 2.0
Qwen 2.5	7B, 14B, 72B	Multilingual, strong on benchmarks	Apache 2.0
Gemma 2	2B, 9B, 27B	Efficient, good for resource-constrained deployment	Google
DeepSeek	7B, 67B	Strong at code and math	DeepSeek License
Phi-3	3.8B, 14B	Small but capable, good for edge deployment	MIT

Selection Criteria

Task fit — models pre-trained on code (DeepSeek, CodeLlama) fine-tune better for code tasks
Size vs. hardware — a 7B model fine-tunes on a single GPU; 70B needs multi-GPU setups
License — check if commercial use is permitted for your deployment scenario
Community support — more popular models have more fine-tuning guides, adapters, and quantized versions available

For most tasks, start with a 7B–8B model. It's large enough to be capable but small enough to fine-tune quickly and deploy on modest hardware.

Step 3: Configure Training

Full Fine-Tuning vs. LoRA

Full fine-tuning updates every weight in the model. It produces the best results but requires significant GPU memory — often multiple high-end GPUs for models above 7B parameters.

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices that modify the model's behavior. It uses a fraction of the memory and trains much faster, with results that are often within 5% of full fine-tuning.

QLoRA goes further by quantizing the base model to 4-bit precision before applying LoRA adapters, allowing you to fine-tune a 7B model on a single consumer GPU with 8GB VRAM.

For most teams, LoRA or QLoRA is the right choice. Full fine-tuning is reserved for cases where you have significant compute resources and need maximum performance.

Key Hyperparameters

Parameter	Typical Range	What It Does
Learning rate	1e-5 to 5e-4	How aggressively the model updates its weights. Too high = instability; too low = slow learning.
Epochs	1–5	How many times the model sees the full dataset. More epochs risk overfitting.
Batch size	4–32	Examples processed simultaneously. Larger = smoother gradients but more memory.
LoRA rank	8–64	Adapter capacity. Higher rank = more expressive but more parameters.
LoRA alpha	16–128	Scaling factor for LoRA updates. Usually set to 2× the rank.

Practical Starting Configuration

For a 7B model with LoRA on a single GPU:

Learning rate: 2e-4
Epochs: 3
Batch size: 8 (with gradient accumulation)
LoRA rank: 16
LoRA alpha: 32
LoRA target modules: q_proj, v_proj, k_proj, o_proj
Warmup steps: 100
Weight decay: 0.01

Start here and adjust based on validation loss. If loss plateaus early, increase learning rate or rank. If loss spikes, reduce learning rate.

Step 4: Train and Monitor

During training, watch two key metrics:

Training loss — should decrease steadily. A sudden spike means the learning rate is too high.
Validation loss — should track training loss. When validation loss starts increasing while training loss continues to decrease, you're overfitting.

Signs of Problems

Symptom	Likely Cause	Fix
Loss doesn't decrease	Learning rate too low or data issues	Increase learning rate; check data formatting
Loss spikes then recovers	Learning rate too high	Reduce learning rate by 2–5×
Validation loss diverges from training	Overfitting	Reduce epochs, add dropout, use more data
Output is repetitive or degenerate	Catastrophic forgetting or bad data	Lower learning rate, check data quality

Training a 7B model on 5,000 examples with LoRA typically takes 30–90 minutes on a single A100 GPU. With Ertas Studio, this runs on managed cloud GPUs so you don't need to provision any hardware.

Step 5: Evaluate Your Model

Don't skip evaluation. A model that scores well on training loss can still produce poor real-world outputs.

Evaluation Methods

Held-out test set — run the model on examples it hasn't seen during training. Compare outputs against ground truth.
A/B comparison — generate outputs from both the base model and fine-tuned model on the same prompts. Have domain experts rate which is better.
Task-specific metrics — accuracy for classification, ROUGE for summarization, exact match for extraction tasks.
Vibe check — sometimes the most important evaluation is just using the model and seeing if it feels right for your use case.

What Good Looks Like

The model follows your output format consistently
Domain terminology is used correctly
Hallucinations are reduced compared to the base model
Outputs match the tone and style of your training examples

If results aren't satisfactory, iterate: review training data quality, adjust hyperparameters, or add more examples for the failure cases.

Step 6: Export and Deploy

Once you're happy with your model, export it for deployment. The most common format for local inference is GGUF — an open standard supported by llama.cpp, Ollama, LM Studio, and many other tools.

Why GGUF?

Quantization built in — reduce model size by 2–4× with minimal quality loss
CPU inference — runs on consumer hardware without a GPU
Universal compatibility — works with every major local inference tool
No vendor lock-in — it's an open format you control

Deployment Options

Option	Best For	Setup Effort
Ollama	Quick local testing, API-compatible serving	Minimal
LM Studio	Desktop chat interface, non-technical users	Minimal
llama.cpp	Maximum control, custom applications	Moderate
vLLM	Production serving with high throughput	Moderate
Open WebUI	Team-facing ChatGPT-like interface	Moderate

Example: Deploy with Ollama

After exporting your GGUF from Ertas Studio:

# Create a Modelfile
echo 'FROM ./my-fine-tuned-model.gguf' > Modelfile

# Import into Ollama
ollama create my-model -f Modelfile

# Run inference
ollama run my-model "Classify this ticket: I can't reset my password"

Your fine-tuned model is now running entirely on your hardware. No API calls, no per-token costs, no data leaving your network.

The Faster Way: Ertas Studio

The workflow above involves setting up training environments, writing configuration files, and managing GPU instances. Ertas Studio handles all of that through a visual canvas interface:

Upload your JSONL dataset — Studio validates your data and flags issues before training starts
Select a base model — browse available models or import from Hugging Face
Configure and launch — set hyperparameters visually and start training on managed cloud GPUs
Compare results — run multiple fine-tuning jobs side by side and compare outputs on the same canvas
Export as GGUF — download your model and deploy anywhere

No training scripts. No infrastructure to manage. No terminal required.

Lock in early bird pricing at $14.50/mo — this price is guaranteed for life and increases to $34.50/mo at launch. Join the waitlist →

Frequently Asked Questions

How long does it take to fine-tune an LLM?

Fine-tuning time depends on your dataset size, base model, and hardware. Training a 7B model on 5,000 examples with LoRA typically takes 30-90 minutes on a single A100 GPU. Smaller datasets (500-1,000 examples) can finish in under 15 minutes. Using QLoRA on consumer GPUs (RTX 3090/4090) takes 2-4x longer but is still measured in hours, not days. The data preparation step often takes longer than the actual training.

What hardware do I need for fine-tuning?

For LoRA/QLoRA fine-tuning of a 7B model, you need a GPU with at least 8 GB VRAM (e.g., NVIDIA RTX 3060). QLoRA specifically was designed to fine-tune on consumer hardware — a single RTX 4090 with 24 GB VRAM can handle models up to 33B parameters. For full fine-tuning (not recommended for most teams), you need multiple high-end GPUs like A100s. Cloud GPU providers like Lambda Labs, RunPod, or managed services like Ertas Studio eliminate the hardware requirement entirely.

How much training data do I need?

It varies by task complexity. For classification tasks, 100-200 examples per class is the minimum, with 500-1,000 being the sweet spot. Conversational fine-tuning needs at least 1,000 examples, ideally 5,000-10,000. Code generation tasks start at around 500 examples. Quality matters far more than quantity — according to research from Meta AI, 500 high-quality, expert-curated examples often outperform 10,000 noisy ones. Start small, evaluate, and add more data targeting the failure cases.

Can I fine-tune without coding?

Yes. Tools like Ertas Studio, Hugging Face AutoTrain, and OpenAI's fine-tuning API provide visual or simplified interfaces that handle the training pipeline for you. You prepare your JSONL dataset, upload it, select a base model, configure basic parameters, and start training. No Python scripts, no GPU provisioning, and no infrastructure management required.

How to Fine-Tune an LLM: The Complete 2026 Guide