Back to blog
    Fine-Tuning Llama 3.3 and Qwen 2.5 with QLoRA: Benchmark Comparison
    ml-engineeringbenchmarkllamaqwenqlorasegment:ml-engineer

    Fine-Tuning Llama 3.3 and Qwen 2.5 with QLoRA: Benchmark Comparison

    Head-to-head comparison of fine-tuning Llama 3.3 8B and Qwen 2.5 7B with QLoRA across common tasks — classification, extraction, generation — with benchmarks, VRAM usage, and practical recommendations.

    EErtas Team·

    Llama 3.3 8B and Qwen 2.5 7B have emerged as the two dominant base models for production fine-tuning in early 2026. Both are permissively licensed, well-supported by the fine-tuning ecosystem, and small enough to train on a single consumer GPU. But which one should you actually use?

    The answer depends on your task, your data, and your deployment constraints. This article provides a controlled benchmark comparison across three common fine-tuning tasks, with identical training configurations, to give you data rather than opinions.

    Why These Two Models

    The sub-10B parameter class is the sweet spot for production fine-tuning. These models are large enough to capture complex task-specific patterns, small enough to fine-tune on a single 24GB GPU, and fast enough to serve at low latency in production.

    Llama 3.3 8B is Meta's latest iteration in the Llama family. It benefits from a massive pretraining corpus, a robust tokenizer with a 128K vocabulary, and strong English-language performance. The Llama ecosystem is the most mature in open-source AI, with extensive tooling support.

    Qwen 2.5 7B is Alibaba's flagship small model. It was pretrained on a highly multilingual corpus with strong representation of CJK languages and code. It uses a 152K vocabulary tokenizer and has shown particularly strong performance on structured tasks in community benchmarks.

    Both models support the same fine-tuning techniques and can be exported to the same inference formats. The choice between them is purely about task-level performance.

    Test Setup

    To ensure a fair comparison, we controlled every variable except the base model.

    Training configuration:

    • Method: QLoRA (4-bit quantisation, LoRA rank 16, alpha 32)
    • Learning rate: 2e-4 with cosine schedule
    • Batch size: 4 (with gradient accumulation to effective batch size 16)
    • Epochs: 3
    • Hardware: Single NVIDIA RTX 4090 (24GB VRAM)

    Datasets:

    • Classification: 5,000 labelled customer support tickets (12 categories)
    • Entity extraction: 3,000 annotated business documents (company names, dates, monetary values, product references)
    • Text generation: 2,000 instruction-response pairs for technical documentation

    Each dataset was split 80/10/10 into train/validation/test sets. Evaluation was performed on the held-out test set after training completed.

    Results

    Classification (Customer Support Tickets)

    MetricLlama 3.3 8BQwen 2.5 7B
    Accuracy94.2%93.8%
    Macro F10.9210.917
    Weighted F10.9410.937

    Both models performed comparably on classification. Llama had a marginal edge, likely due to its stronger English language pretraining. The difference is not statistically significant — either model is an excellent choice for classification tasks.

    Entity Extraction (Business Documents)

    MetricLlama 3.3 8BQwen 2.5 7B
    Entity-level F10.8870.912
    Exact match81.3%85.7%
    Partial match91.2%93.1%

    Qwen showed a meaningful advantage on entity extraction. Its tokenizer handles mixed-format text — dates in various formats, currency symbols, alphanumeric product codes — more consistently than Llama's. The 2.5 percentage point difference in exact match accuracy is significant in production, where partial extraction failures cascade into downstream errors.

    Text Generation (Technical Documentation)

    MetricLlama 3.3 8BQwen 2.5 7B
    ROUGE-L0.6730.651
    BERTScore F10.8940.882
    Human preference (blind)62%38%

    Llama produced noticeably better English prose. Its outputs were more fluent, better structured, and more consistent in tone. Human evaluators preferred Llama outputs nearly two-to-one. For English-language generation tasks, Llama 3.3 is the stronger base.

    VRAM Usage Comparison

    Memory efficiency matters for production deployment, especially on constrained hardware.

    PhaseLlama 3.3 8BQwen 2.5 7B
    Training (QLoRA)14.2 GB12.8 GB
    Peak training18.1 GB16.3 GB
    Inference (Q4_K_M GGUF)5.1 GB4.6 GB
    Inference (Q8_0 GGUF)8.5 GB7.4 GB

    Qwen is consistently more memory-efficient, reflecting its smaller parameter count (7B vs 8B). The difference is modest but can matter on devices with tight memory budgets — a 16GB MacBook, for example, runs Qwen's Q8_0 quantisation comfortably while Llama's Q8_0 leaves less room for the operating system and other applications.

    Both models fit comfortably within a 24GB GPU for training with QLoRA. Neither requires multi-GPU setups or offloading strategies.

    Training Speed Comparison

    MetricLlama 3.3 8BQwen 2.5 7B
    Tokens/second (training)1,8402,120
    Time per epoch (5K samples)42 min36 min
    Total training time (3 epochs)2h 06min1h 48min

    Qwen trains approximately 15% faster than Llama on identical hardware, again reflecting the parameter count difference. Over three epochs, you save roughly 18 minutes — not dramatic for a single run, but meaningful when iterating over multiple experiments.

    GGUF Inference Speed Comparison

    Production inference speed was measured using llama.cpp with the Q4_K_M quantisation on the same RTX 4090.

    MetricLlama 3.3 8BQwen 2.5 7B
    Prompt processing (tok/s)3,2403,680
    Generation (tok/s)98.2112.5
    Time to first token28ms24ms

    Qwen is faster at inference across the board, with a particularly notable advantage in generation speed. At 112.5 tokens per second, Qwen delivers responses perceptibly faster to end users.

    Key Findings

    Qwen 2.5 7B is the better choice for: entity extraction, structured output tasks, multilingual applications, memory-constrained deployments, and latency-sensitive production environments. It trains faster, runs faster, and uses less memory.

    Llama 3.3 8B is the better choice for: English-language text generation, creative or conversational tasks, and applications where prose quality is the primary metric. It produces more fluent, natural English output.

    Both are excellent choices for: classification, sentiment analysis, and other tasks where the quality difference is within noise.

    If you are starting a new fine-tuning project and are unsure which to choose, default to Qwen 2.5 7B unless your primary task is English text generation. The memory and speed advantages compound in production, and the extraction performance gap is meaningful.

    How to Choose: A Decision Framework

    Ask yourself three questions:

    1. Is your primary task text generation in English? If yes, start with Llama.
    2. Is your deployment memory-constrained? If yes, start with Qwen.
    3. Does your task involve structured extraction or multilingual data? If yes, start with Qwen.

    If none of these apply strongly, run a quick experiment with both. The training cost of a single QLoRA run on either model is minimal — under two hours on consumer hardware. Let your specific data and task decide.

    Fine-Tune Both with Ertas

    Ertas Studio supports both Llama 3.3 and Qwen 2.5 as base models for fine-tuning. You can run parallel experiments with identical configurations and compare results directly in the evaluation dashboard — exactly the kind of controlled comparison we performed for this article, without the manual setup.

    Ready to benchmark models on your data? Join the Ertas waitlist and start experimenting.

    Further Reading

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading