Fine-Tune Llama 4 with Ertas

Meta's fourth-generation open-weight model family featuring a mixture-of-experts architecture, with Scout (109B total, 17B active) for efficient deployment and Maverick (400B total, 17B active) for high-capability tasks.

Scout 109B (17B active)Maverick 400B (17B active)Meta

Overview

Llama 4 marks Meta's transition to a mixture-of-experts (MoE) architecture for its flagship open-weight model family. Released in early 2025, the family includes two models: Llama 4 Scout, with 109B total parameters and 17B active per forward pass across 16 experts, and Llama 4 Maverick, with 400B total parameters and 17B active per forward pass across 128 experts. Both models use a shared routing mechanism that activates only a subset of experts for each token, dramatically improving inference efficiency.

The MoE architecture means Llama 4 delivers performance far beyond what its active parameter count would suggest. Scout, with only 17B active parameters, competes with dense models in the 70B+ range on many benchmarks, while Maverick approaches frontier-model performance at a fraction of the computational cost of dense 400B+ models.

Llama 4 was trained with a native context window of 128K tokens, with Scout supporting extended context up to 10 million tokens through innovative techniques in positional encoding. The models are natively multimodal, supporting text and image inputs, enabling vision-language tasks out of the box.

Both models were trained on a significantly larger and more diverse dataset than Llama 3, incorporating multilingual data across over 200 languages. The instruction-tuned variants demonstrate strong performance on agentic workflows, tool use, structured output generation, and complex multi-turn reasoning.

Key Features

The mixture-of-experts architecture is the defining innovation of Llama 4. By routing each token to only 1-2 experts out of the full expert pool, the models achieve high quality while keeping inference costs comparable to much smaller dense models. This makes Llama 4 Scout particularly attractive for production deployments — you get 70B-class performance with approximately 17B-class inference speed and memory usage for the active parameters.

Native multimodality is another significant advancement. Llama 4 can process interleaved text and image inputs without requiring a separate vision encoder pipeline. This enables use cases like visual question answering, chart and diagram understanding, document OCR with reasoning, and image-guided code generation.

The extended context capabilities of Scout (up to 10M tokens) open entirely new application categories, including full-codebase analysis, book-length document processing, and extremely long conversation histories. Maverick's 128K native context is sufficient for most production applications while delivering higher quality on complex reasoning tasks.

Fine-Tuning with Ertas

Fine-tuning Llama 4 Scout in Ertas Studio is remarkably efficient thanks to the MoE architecture. Since only 17B parameters are active per forward pass, QLoRA fine-tuning targets the active expert pathways and shared layers, requiring approximately 24-32GB VRAM — achievable on a single A100 40GB or dual RTX 4090 setup. Upload your dataset, select Llama 4 Scout as the base model, and Ertas Studio handles the MoE-aware LoRA configuration automatically.

For Maverick, fine-tuning requires more resources due to the larger total parameter count (400B), but QLoRA with 4-bit quantization brings requirements down to approximately 80-96GB VRAM, achievable on dual A100 80GB GPUs. Ertas Studio manages the expert routing and ensures LoRA adapters are applied correctly across the MoE layers.

After training, Ertas Studio exports your fine-tuned model to GGUF format. The MoE architecture quantizes efficiently — expert weights that are not active for a given token do not consume compute during inference, so quantized Llama 4 Scout models run surprisingly fast on consumer hardware. Deploy through Ollama or llama.cpp for immediate local inference.

Use Cases

Llama 4 Scout is ideal for production deployments where you need high-quality responses with efficient resource usage. Its 17B active parameter footprint makes it suitable for API serving at scale, customer-facing chatbots, RAG pipelines, and real-time applications. The extended context window makes it particularly strong for document processing, legal analysis, and codebase understanding tasks.

Llama 4 Maverick targets high-capability applications: complex multi-step reasoning, advanced code generation and debugging, research synthesis, and agentic workflows that require planning and tool orchestration. Its quality approaches frontier models while remaining deployable on-premise.

The native multimodal capabilities make both models excellent for vision-language applications: analyzing charts and graphs in business reports, extracting structured data from images of documents, visual QA for accessibility applications, and multimodal content generation pipelines.

Hardware Requirements

Llama 4 Scout at Q4_K_M quantization requires approximately 60-65GB of RAM for the full model weights (all experts must be loaded even though only a subset is active per token). This is runnable on systems with 64-128GB RAM for CPU inference, or on GPUs like the A100 80GB. At Q8_0, expect approximately 115GB. Despite the larger memory footprint compared to a 17B dense model, inference speed is comparable to dense 17B models since only active experts are computed.

Llama 4 Maverick at Q4_K_M requires approximately 220-240GB of RAM, necessitating multi-GPU configurations (e.g., 4x A100 80GB) or high-memory CPU inference nodes. The model's quality-to-compute ratio makes this investment worthwhile for organizations needing frontier-class performance without relying on cloud APIs.

For fine-tuning with Ertas Studio, Scout requires 24-32GB VRAM with QLoRA (single A100 40GB), while Maverick requires 80-96GB VRAM (dual A100 80GB). These are significantly lower than what dense models of equivalent quality would demand.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

LM Studio

Integration

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →