Fine-Tuning Llama 3.3 and Qwen 2.5 with QLoRA: Benchmark Comparison

Llama 3.3 8B and Qwen 2.5 7B have emerged as the two dominant base models for production fine-tuning in early 2026. Both are permissively licensed, well-supported by the fine-tuning ecosystem, and small enough to train on a single consumer GPU. But which one should you actually use?

The answer depends on your task, your data, and your deployment constraints. This article provides a controlled benchmark comparison across three common fine-tuning tasks, with identical training configurations, to give you data rather than opinions.

Why These Two Models

The sub-10B parameter class is the sweet spot for production fine-tuning. These models are large enough to capture complex task-specific patterns, small enough to fine-tune on a single 24GB GPU, and fast enough to serve at low latency in production.

Llama 3.3 8B is Meta's latest iteration in the Llama family. It benefits from a massive pretraining corpus, a robust tokenizer with a 128K vocabulary, and strong English-language performance. The Llama ecosystem is the most mature in open-source AI, with extensive tooling support.

Qwen 2.5 7B is Alibaba's flagship small model. It was pretrained on a highly multilingual corpus with strong representation of CJK languages and code. It uses a 152K vocabulary tokenizer and has shown particularly strong performance on structured tasks in community benchmarks.

Both models support the same fine-tuning techniques and can be exported to the same inference formats. The choice between them is purely about task-level performance.

Test Setup

To ensure a fair comparison, we controlled every variable except the base model.

Training configuration:

Method: QLoRA (4-bit quantisation, LoRA rank 16, alpha 32)
Learning rate: 2e-4 with cosine schedule
Batch size: 4 (with gradient accumulation to effective batch size 16)
Epochs: 3
Hardware: Single NVIDIA RTX 4090 (24GB VRAM)

Datasets:

Classification: 5,000 labelled customer support tickets (12 categories)
Entity extraction: 3,000 annotated business documents (company names, dates, monetary values, product references)
Text generation: 2,000 instruction-response pairs for technical documentation

Each dataset was split 80/10/10 into train/validation/test sets. Evaluation was performed on the held-out test set after training completed.

Results

Classification (Customer Support Tickets)

Metric	Llama 3.3 8B	Qwen 2.5 7B
Accuracy	94.2%	93.8%
Macro F1	0.921	0.917
Weighted F1	0.941	0.937

Both models performed comparably on classification. Llama had a marginal edge, likely due to its stronger English language pretraining. The difference is not statistically significant — either model is an excellent choice for classification tasks.

Entity Extraction (Business Documents)

Metric	Llama 3.3 8B	Qwen 2.5 7B
Entity-level F1	0.887	0.912
Exact match	81.3%	85.7%
Partial match	91.2%	93.1%

Qwen showed a meaningful advantage on entity extraction. Its tokenizer handles mixed-format text — dates in various formats, currency symbols, alphanumeric product codes — more consistently than Llama's. The 2.5 percentage point difference in exact match accuracy is significant in production, where partial extraction failures cascade into downstream errors.

Text Generation (Technical Documentation)

Metric	Llama 3.3 8B	Qwen 2.5 7B
ROUGE-L	0.673	0.651
BERTScore F1	0.894	0.882
Human preference (blind)	62%	38%

Llama produced noticeably better English prose. Its outputs were more fluent, better structured, and more consistent in tone. Human evaluators preferred Llama outputs nearly two-to-one. For English-language generation tasks, Llama 3.3 is the stronger base.

VRAM Usage Comparison

Memory efficiency matters for production deployment, especially on constrained hardware.

Phase	Llama 3.3 8B	Qwen 2.5 7B
Training (QLoRA)	14.2 GB	12.8 GB
Peak training	18.1 GB	16.3 GB
Inference (Q4_K_M GGUF)	5.1 GB	4.6 GB
Inference (Q8_0 GGUF)	8.5 GB	7.4 GB

Qwen is consistently more memory-efficient, reflecting its smaller parameter count (7B vs 8B). The difference is modest but can matter on devices with tight memory budgets — a 16GB MacBook, for example, runs Qwen's Q8_0 quantisation comfortably while Llama's Q8_0 leaves less room for the operating system and other applications.

Both models fit comfortably within a 24GB GPU for training with QLoRA. Neither requires multi-GPU setups or offloading strategies.

Training Speed Comparison

Metric	Llama 3.3 8B	Qwen 2.5 7B
Tokens/second (training)	1,840	2,120
Time per epoch (5K samples)	42 min	36 min
Total training time (3 epochs)	2h 06min	1h 48min

Qwen trains approximately 15% faster than Llama on identical hardware, again reflecting the parameter count difference. Over three epochs, you save roughly 18 minutes — not dramatic for a single run, but meaningful when iterating over multiple experiments.

GGUF Inference Speed Comparison

Production inference speed was measured using llama.cpp with the Q4_K_M quantisation on the same RTX 4090.

Metric	Llama 3.3 8B	Qwen 2.5 7B
Prompt processing (tok/s)	3,240	3,680
Generation (tok/s)	98.2	112.5
Time to first token	28ms	24ms

Qwen is faster at inference across the board, with a particularly notable advantage in generation speed. At 112.5 tokens per second, Qwen delivers responses perceptibly faster to end users.

Key Findings

Qwen 2.5 7B is the better choice for: entity extraction, structured output tasks, multilingual applications, memory-constrained deployments, and latency-sensitive production environments. It trains faster, runs faster, and uses less memory.

Llama 3.3 8B is the better choice for: English-language text generation, creative or conversational tasks, and applications where prose quality is the primary metric. It produces more fluent, natural English output.

Both are excellent choices for: classification, sentiment analysis, and other tasks where the quality difference is within noise.

If you are starting a new fine-tuning project and are unsure which to choose, default to Qwen 2.5 7B unless your primary task is English text generation. The memory and speed advantages compound in production, and the extraction performance gap is meaningful.

How to Choose: A Decision Framework

Ask yourself three questions:

Is your primary task text generation in English? If yes, start with Llama.
Is your deployment memory-constrained? If yes, start with Qwen.
Does your task involve structured extraction or multilingual data? If yes, start with Qwen.

If none of these apply strongly, run a quick experiment with both. The training cost of a single QLoRA run on either model is minimal — under two hours on consumer hardware. Let your specific data and task decide.

Fine-Tune Both with Ertas

Ertas Studio supports both Llama 3.3 and Qwen 2.5 as base models for fine-tuning. You can run parallel experiments with identical configurations and compare results directly in the evaluation dashboard — exactly the kind of controlled comparison we performed for this article, without the manual setup.

Ready to benchmark models on your data? Join the Ertas waitlist and start experimenting.

Fine-Tuning Llama 3.3 and Qwen 2.5 with QLoRA: Benchmark Comparison

Why These Two Models

Test Setup

Results

Classification (Customer Support Tickets)

Entity Extraction (Business Documents)

Text Generation (Technical Documentation)

VRAM Usage Comparison

Training Speed Comparison

GGUF Inference Speed Comparison

Key Findings

How to Choose: A Decision Framework

Fine-Tune Both with Ertas

Further Reading

Ship AI that runs on your users' devices.

Keep reading

Model Distillation with LoRA: Training Smaller Models from Frontier Outputs

Ertas Studio vs. Unsloth vs. Axolotl: Fine-Tuning Tools Compared (2026)

Synthetic Data Generation for Fine-Tuning: Techniques That Work