
Fine-Tuning Llama 3.3 and Qwen 2.5 with QLoRA: Benchmark Comparison
Head-to-head comparison of fine-tuning Llama 3.3 8B and Qwen 2.5 7B with QLoRA across common tasks — classification, extraction, generation — with benchmarks, VRAM usage, and practical recommendations.
Llama 3.3 8B and Qwen 2.5 7B have emerged as the two dominant base models for production fine-tuning in early 2026. Both are permissively licensed, well-supported by the fine-tuning ecosystem, and small enough to train on a single consumer GPU. But which one should you actually use?
The answer depends on your task, your data, and your deployment constraints. This article provides a controlled benchmark comparison across three common fine-tuning tasks, with identical training configurations, to give you data rather than opinions.
Why These Two Models
The sub-10B parameter class is the sweet spot for production fine-tuning. These models are large enough to capture complex task-specific patterns, small enough to fine-tune on a single 24GB GPU, and fast enough to serve at low latency in production.
Llama 3.3 8B is Meta's latest iteration in the Llama family. It benefits from a massive pretraining corpus, a robust tokenizer with a 128K vocabulary, and strong English-language performance. The Llama ecosystem is the most mature in open-source AI, with extensive tooling support.
Qwen 2.5 7B is Alibaba's flagship small model. It was pretrained on a highly multilingual corpus with strong representation of CJK languages and code. It uses a 152K vocabulary tokenizer and has shown particularly strong performance on structured tasks in community benchmarks.
Both models support the same fine-tuning techniques and can be exported to the same inference formats. The choice between them is purely about task-level performance.
Test Setup
To ensure a fair comparison, we controlled every variable except the base model.
Training configuration:
- Method: QLoRA (4-bit quantisation, LoRA rank 16, alpha 32)
- Learning rate: 2e-4 with cosine schedule
- Batch size: 4 (with gradient accumulation to effective batch size 16)
- Epochs: 3
- Hardware: Single NVIDIA RTX 4090 (24GB VRAM)
Datasets:
- Classification: 5,000 labelled customer support tickets (12 categories)
- Entity extraction: 3,000 annotated business documents (company names, dates, monetary values, product references)
- Text generation: 2,000 instruction-response pairs for technical documentation
Each dataset was split 80/10/10 into train/validation/test sets. Evaluation was performed on the held-out test set after training completed.
Results
Classification (Customer Support Tickets)
| Metric | Llama 3.3 8B | Qwen 2.5 7B |
|---|---|---|
| Accuracy | 94.2% | 93.8% |
| Macro F1 | 0.921 | 0.917 |
| Weighted F1 | 0.941 | 0.937 |
Both models performed comparably on classification. Llama had a marginal edge, likely due to its stronger English language pretraining. The difference is not statistically significant — either model is an excellent choice for classification tasks.
Entity Extraction (Business Documents)
| Metric | Llama 3.3 8B | Qwen 2.5 7B |
|---|---|---|
| Entity-level F1 | 0.887 | 0.912 |
| Exact match | 81.3% | 85.7% |
| Partial match | 91.2% | 93.1% |
Qwen showed a meaningful advantage on entity extraction. Its tokenizer handles mixed-format text — dates in various formats, currency symbols, alphanumeric product codes — more consistently than Llama's. The 2.5 percentage point difference in exact match accuracy is significant in production, where partial extraction failures cascade into downstream errors.
Text Generation (Technical Documentation)
| Metric | Llama 3.3 8B | Qwen 2.5 7B |
|---|---|---|
| ROUGE-L | 0.673 | 0.651 |
| BERTScore F1 | 0.894 | 0.882 |
| Human preference (blind) | 62% | 38% |
Llama produced noticeably better English prose. Its outputs were more fluent, better structured, and more consistent in tone. Human evaluators preferred Llama outputs nearly two-to-one. For English-language generation tasks, Llama 3.3 is the stronger base.
VRAM Usage Comparison
Memory efficiency matters for production deployment, especially on constrained hardware.
| Phase | Llama 3.3 8B | Qwen 2.5 7B |
|---|---|---|
| Training (QLoRA) | 14.2 GB | 12.8 GB |
| Peak training | 18.1 GB | 16.3 GB |
| Inference (Q4_K_M GGUF) | 5.1 GB | 4.6 GB |
| Inference (Q8_0 GGUF) | 8.5 GB | 7.4 GB |
Qwen is consistently more memory-efficient, reflecting its smaller parameter count (7B vs 8B). The difference is modest but can matter on devices with tight memory budgets — a 16GB MacBook, for example, runs Qwen's Q8_0 quantisation comfortably while Llama's Q8_0 leaves less room for the operating system and other applications.
Both models fit comfortably within a 24GB GPU for training with QLoRA. Neither requires multi-GPU setups or offloading strategies.
Training Speed Comparison
| Metric | Llama 3.3 8B | Qwen 2.5 7B |
|---|---|---|
| Tokens/second (training) | 1,840 | 2,120 |
| Time per epoch (5K samples) | 42 min | 36 min |
| Total training time (3 epochs) | 2h 06min | 1h 48min |
Qwen trains approximately 15% faster than Llama on identical hardware, again reflecting the parameter count difference. Over three epochs, you save roughly 18 minutes — not dramatic for a single run, but meaningful when iterating over multiple experiments.
GGUF Inference Speed Comparison
Production inference speed was measured using llama.cpp with the Q4_K_M quantisation on the same RTX 4090.
| Metric | Llama 3.3 8B | Qwen 2.5 7B |
|---|---|---|
| Prompt processing (tok/s) | 3,240 | 3,680 |
| Generation (tok/s) | 98.2 | 112.5 |
| Time to first token | 28ms | 24ms |
Qwen is faster at inference across the board, with a particularly notable advantage in generation speed. At 112.5 tokens per second, Qwen delivers responses perceptibly faster to end users.
Key Findings
Qwen 2.5 7B is the better choice for: entity extraction, structured output tasks, multilingual applications, memory-constrained deployments, and latency-sensitive production environments. It trains faster, runs faster, and uses less memory.
Llama 3.3 8B is the better choice for: English-language text generation, creative or conversational tasks, and applications where prose quality is the primary metric. It produces more fluent, natural English output.
Both are excellent choices for: classification, sentiment analysis, and other tasks where the quality difference is within noise.
If you are starting a new fine-tuning project and are unsure which to choose, default to Qwen 2.5 7B unless your primary task is English text generation. The memory and speed advantages compound in production, and the extraction performance gap is meaningful.
How to Choose: A Decision Framework
Ask yourself three questions:
- Is your primary task text generation in English? If yes, start with Llama.
- Is your deployment memory-constrained? If yes, start with Qwen.
- Does your task involve structured extraction or multilingual data? If yes, start with Qwen.
If none of these apply strongly, run a quick experiment with both. The training cost of a single QLoRA run on either model is minimal — under two hours on consumer hardware. Let your specific data and task decide.
Fine-Tune Both with Ertas
Ertas Studio supports both Llama 3.3 and Qwen 2.5 as base models for fine-tuning. You can run parallel experiments with identical configurations and compare results directly in the evaluation dashboard — exactly the kind of controlled comparison we performed for this article, without the manual setup.
Ready to benchmark models on your data? Join the Ertas waitlist and start experimenting.
Further Reading
Ship AI that runs on your users' devices.
Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.
Keep reading

Model Distillation with LoRA: Training Smaller Models from Frontier Outputs
A technical guide to distilling GPT-4 and Claude outputs into compact, deployable models using LoRA fine-tuning — the practical path from API dependency to model ownership.

Ertas Studio vs. Unsloth vs. Axolotl: Fine-Tuning Tools Compared (2026)
A practical comparison of three popular fine-tuning tools — Ertas Studio, Unsloth, and Axolotl — covering ease of use, performance, GPU requirements, and production deployment workflows.

Synthetic Data Generation for Fine-Tuning: Techniques That Work
Practical techniques for generating high-quality synthetic training data using frontier models — covering prompt engineering, data augmentation, and quality filtering for fine-tuning datasets.