Back to blog
    How to Fine-Tune an LLM: The Complete 2026 Guide
    fine-tuningguidellmloragguftutorial

    How to Fine-Tune an LLM: The Complete 2026 Guide

    Learn how to fine-tune large language models step by step — from preparing training data and choosing a base model to configuring LoRA, evaluating results, and deploying locally.

    EEdward Yang··Updated

    To fine-tune an LLM, you prepare a JSONL dataset of instruction-response pairs, select a base model (typically 7B-8B parameters), apply LoRA or QLoRA adapters, train for 1-5 epochs, and export the result as a GGUF file for deployment. The entire process takes 30 minutes to a few hours depending on dataset size and hardware. According to Hugging Face, parameter-efficient fine-tuning methods like LoRA can reduce trainable parameters by over 99% while achieving results within 2-5% of full fine-tuning. Research from the Stanford HAI AI Index Report shows that fine-tuned smaller models consistently outperform larger prompted models on domain-specific tasks, making fine-tuning one of the most cost-effective ways to build production AI.

    This guide walks through the entire process: when fine-tuning makes sense, how to prepare your data, which base model to pick, how to configure training, and how to deploy the result.

    When Should You Fine-Tune?

    Fine-tuning isn't always the right answer. Before you invest time preparing data and running training jobs, consider whether your problem actually requires it.

    Fine-tuning makes sense when:

    • Prompt engineering hits a ceiling. You've tried few-shot examples, chain-of-thought prompting, and system instructions, but the model still doesn't produce consistent results for your domain.
    • You need a specific output format. Your application requires structured JSON, a particular writing style, or domain-specific terminology that base models struggle to produce reliably.
    • Latency and cost matter at scale. A fine-tuned 7B model can outperform a prompted 70B model on narrow tasks — at a fraction of the inference cost.
    • Privacy requirements prohibit cloud APIs. Fine-tuned models can run entirely on your infrastructure, keeping sensitive data off third-party servers.

    Consider alternatives when:

    • Your task is broad and changes frequently — prompt engineering or RAG may be more flexible.
    • You have fewer than 100 quality training examples — fine-tuning needs enough data to learn patterns without overfitting.
    • You need the model to access external knowledge that changes often — retrieval-augmented generation handles this better.

    For a deeper comparison, see our guide on fine-tuning vs RAG.

    Step 1: Prepare Your Training Data

    Data quality is the single biggest factor in fine-tuning success. A model trained on 500 excellent examples will outperform one trained on 10,000 mediocre ones.

    Format: JSONL

    The standard format for fine-tuning data is JSONL (JSON Lines) — one JSON object per line. Each line typically contains an instruction and the desired response:

    {"instruction": "Classify this support ticket as billing, technical, or general.", "input": "I can't log in to my account after resetting my password.", "output": "technical"}
    {"instruction": "Classify this support ticket as billing, technical, or general.", "input": "When will I be charged for the annual plan?", "output": "billing"}
    

    For conversational models, use a messages format:

    {"messages": [{"role": "system", "content": "You are a medical assistant."}, {"role": "user", "content": "What are common side effects of metformin?"}, {"role": "assistant", "content": "Common side effects include nausea, diarrhea, and stomach pain..."}]}
    

    Data Quality Checklist

    • Consistent formatting — every example should follow the same structure
    • Diverse examples — cover edge cases, not just the happy path
    • Accurate labels — garbage in, garbage out. Have domain experts review your data.
    • Balanced distribution — if you're training a classifier, roughly equal examples per class prevent the model from defaulting to the majority label
    • No data leakage — keep a validation set separate from training data to measure real performance

    How Much Data Do You Need?

    There's no universal answer, but here are practical starting points:

    Task TypeMinimum ExamplesSweet Spot
    Classification100–200 per class500–1,000 per class
    Summarization5002,000–5,000
    Conversational1,0005,000–10,000
    Code generation5003,000–8,000
    Domain Q&A3001,000–3,000

    More data helps, but returns diminish. Focus on quality first, then scale up.

    Step 2: Choose a Base Model

    Your base model determines your starting point. The right choice depends on your task, hardware constraints, and licensing requirements.

    ModelSizesStrengthsLicense
    Llama 38B, 70BGeneral purpose, strong reasoning, large communityMeta Community
    Mistral7B, 8x7BFast inference, good at code and instruction followingApache 2.0
    Qwen 2.57B, 14B, 72BMultilingual, strong on benchmarksApache 2.0
    Gemma 22B, 9B, 27BEfficient, good for resource-constrained deploymentGoogle
    DeepSeek7B, 67BStrong at code and mathDeepSeek License
    Phi-33.8B, 14BSmall but capable, good for edge deploymentMIT

    Selection Criteria

    1. Task fit — models pre-trained on code (DeepSeek, CodeLlama) fine-tune better for code tasks
    2. Size vs. hardware — a 7B model fine-tunes on a single GPU; 70B needs multi-GPU setups
    3. License — check if commercial use is permitted for your deployment scenario
    4. Community support — more popular models have more fine-tuning guides, adapters, and quantized versions available

    For most tasks, start with a 7B–8B model. It's large enough to be capable but small enough to fine-tune quickly and deploy on modest hardware.

    Step 3: Configure Training

    Full Fine-Tuning vs. LoRA

    Full fine-tuning updates every weight in the model. It produces the best results but requires significant GPU memory — often multiple high-end GPUs for models above 7B parameters.

    LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices that modify the model's behavior. It uses a fraction of the memory and trains much faster, with results that are often within 5% of full fine-tuning.

    QLoRA goes further by quantizing the base model to 4-bit precision before applying LoRA adapters, allowing you to fine-tune a 7B model on a single consumer GPU with 8GB VRAM.

    For most teams, LoRA or QLoRA is the right choice. Full fine-tuning is reserved for cases where you have significant compute resources and need maximum performance.

    Key Hyperparameters

    ParameterTypical RangeWhat It Does
    Learning rate1e-5 to 5e-4How aggressively the model updates its weights. Too high = instability; too low = slow learning.
    Epochs1–5How many times the model sees the full dataset. More epochs risk overfitting.
    Batch size4–32Examples processed simultaneously. Larger = smoother gradients but more memory.
    LoRA rank8–64Adapter capacity. Higher rank = more expressive but more parameters.
    LoRA alpha16–128Scaling factor for LoRA updates. Usually set to 2× the rank.

    Practical Starting Configuration

    For a 7B model with LoRA on a single GPU:

    Learning rate: 2e-4
    Epochs: 3
    Batch size: 8 (with gradient accumulation)
    LoRA rank: 16
    LoRA alpha: 32
    LoRA target modules: q_proj, v_proj, k_proj, o_proj
    Warmup steps: 100
    Weight decay: 0.01
    

    Start here and adjust based on validation loss. If loss plateaus early, increase learning rate or rank. If loss spikes, reduce learning rate.

    Step 4: Train and Monitor

    During training, watch two key metrics:

    • Training loss — should decrease steadily. A sudden spike means the learning rate is too high.
    • Validation loss — should track training loss. When validation loss starts increasing while training loss continues to decrease, you're overfitting.

    Signs of Problems

    SymptomLikely CauseFix
    Loss doesn't decreaseLearning rate too low or data issuesIncrease learning rate; check data formatting
    Loss spikes then recoversLearning rate too highReduce learning rate by 2–5×
    Validation loss diverges from trainingOverfittingReduce epochs, add dropout, use more data
    Output is repetitive or degenerateCatastrophic forgetting or bad dataLower learning rate, check data quality

    Training a 7B model on 5,000 examples with LoRA typically takes 30–90 minutes on a single A100 GPU. With Ertas Studio, this runs on managed cloud GPUs so you don't need to provision any hardware.

    Step 5: Evaluate Your Model

    Don't skip evaluation. A model that scores well on training loss can still produce poor real-world outputs.

    Evaluation Methods

    1. Held-out test set — run the model on examples it hasn't seen during training. Compare outputs against ground truth.
    2. A/B comparison — generate outputs from both the base model and fine-tuned model on the same prompts. Have domain experts rate which is better.
    3. Task-specific metrics — accuracy for classification, ROUGE for summarization, exact match for extraction tasks.
    4. Vibe check — sometimes the most important evaluation is just using the model and seeing if it feels right for your use case.

    What Good Looks Like

    • The model follows your output format consistently
    • Domain terminology is used correctly
    • Hallucinations are reduced compared to the base model
    • Outputs match the tone and style of your training examples

    If results aren't satisfactory, iterate: review training data quality, adjust hyperparameters, or add more examples for the failure cases.

    Step 6: Export and Deploy

    Once you're happy with your model, export it for deployment. The most common format for local inference is GGUF — an open standard supported by llama.cpp, Ollama, LM Studio, and many other tools.

    Why GGUF?

    • Quantization built in — reduce model size by 2–4× with minimal quality loss
    • CPU inference — runs on consumer hardware without a GPU
    • Universal compatibility — works with every major local inference tool
    • No vendor lock-in — it's an open format you control

    Deployment Options

    OptionBest ForSetup Effort
    OllamaQuick local testing, API-compatible servingMinimal
    LM StudioDesktop chat interface, non-technical usersMinimal
    llama.cppMaximum control, custom applicationsModerate
    vLLMProduction serving with high throughputModerate
    Open WebUITeam-facing ChatGPT-like interfaceModerate

    Example: Deploy with Ollama

    After exporting your GGUF from Ertas Studio:

    # Create a Modelfile
    echo 'FROM ./my-fine-tuned-model.gguf' > Modelfile
    
    # Import into Ollama
    ollama create my-model -f Modelfile
    
    # Run inference
    ollama run my-model "Classify this ticket: I can't reset my password"
    

    Your fine-tuned model is now running entirely on your hardware. No API calls, no per-token costs, no data leaving your network.

    The Faster Way: Ertas Studio

    The workflow above involves setting up training environments, writing configuration files, and managing GPU instances. Ertas Studio handles all of that through a visual canvas interface:

    1. Upload your JSONL dataset — Studio validates your data and flags issues before training starts
    2. Select a base model — browse available models or import from Hugging Face
    3. Configure and launch — set hyperparameters visually and start training on managed cloud GPUs
    4. Compare results — run multiple fine-tuning jobs side by side and compare outputs on the same canvas
    5. Export as GGUF — download your model and deploy anywhere

    No training scripts. No infrastructure to manage. No terminal required.

    Lock in early bird pricing at $14.50/mo — this price is guaranteed for life and increases to $34.50/mo at launch. Join the waitlist →

    Frequently Asked Questions

    How long does it take to fine-tune an LLM?

    Fine-tuning time depends on your dataset size, base model, and hardware. Training a 7B model on 5,000 examples with LoRA typically takes 30-90 minutes on a single A100 GPU. Smaller datasets (500-1,000 examples) can finish in under 15 minutes. Using QLoRA on consumer GPUs (RTX 3090/4090) takes 2-4x longer but is still measured in hours, not days. The data preparation step often takes longer than the actual training.

    What hardware do I need for fine-tuning?

    For LoRA/QLoRA fine-tuning of a 7B model, you need a GPU with at least 8 GB VRAM (e.g., NVIDIA RTX 3060). QLoRA specifically was designed to fine-tune on consumer hardware — a single RTX 4090 with 24 GB VRAM can handle models up to 33B parameters. For full fine-tuning (not recommended for most teams), you need multiple high-end GPUs like A100s. Cloud GPU providers like Lambda Labs, RunPod, or managed services like Ertas Studio eliminate the hardware requirement entirely.

    How much training data do I need?

    It varies by task complexity. For classification tasks, 100-200 examples per class is the minimum, with 500-1,000 being the sweet spot. Conversational fine-tuning needs at least 1,000 examples, ideally 5,000-10,000. Code generation tasks start at around 500 examples. Quality matters far more than quantity — according to research from Meta AI, 500 high-quality, expert-curated examples often outperform 10,000 noisy ones. Start small, evaluate, and add more data targeting the failure cases.

    Can I fine-tune without coding?

    Yes. Tools like Ertas Studio, Hugging Face AutoTrain, and OpenAI's fine-tuning API provide visual or simplified interfaces that handle the training pipeline for you. You prepare your JSONL dataset, upload it, select a base model, configure basic parameters, and start training. No Python scripts, no GPU provisioning, and no infrastructure management required.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading