Fine-Tune Mixtral with Ertas

Mistral AI's mixture-of-experts models that route each token through 2 of 8 expert networks, delivering 70B-class performance at the cost of a 13B dense model in the 8x7B variant.

8x7B8x22BMistral AI

Overview

Mixtral, released by Mistral AI in December 2023 (8x7B) and April 2024 (8x22B), brought mixture-of-experts (MoE) architecture to the mainstream open-weight model ecosystem. The 8x7B variant contains 46.7B total parameters but activates only 12.9B per forward pass by routing each token through 2 of 8 expert feed-forward networks. The result is a model that matches or exceeds Llama 2 70B on most benchmarks while running at roughly the speed of a 13B dense model.

The 8x22B variant scales this approach dramatically, with 141B total parameters and approximately 39B active per token. This model competes with the best open-weight models available, delivering strong performance on reasoning, code, mathematics, and multilingual tasks. Both variants use the same sliding window attention mechanism introduced in Mistral 7B.

The MoE architecture uses a learned router network that assigns each token to its two most relevant experts. Different experts tend to specialize in different types of content — some may focus on code, others on mathematical reasoning, and others on natural language — though this specialization emerges naturally during training rather than being explicitly programmed.

Both models are released under the Apache 2.0 license and have become popular choices for production deployments where quality needs to be high but computational budgets are constrained.

Key Features

The sparse mixture-of-experts architecture is Mixtral's core innovation for the open-weight ecosystem. The router network adds negligible overhead, while the expert selection mechanism ensures that computational cost scales with the number of active parameters rather than total parameters. This means Mixtral 8x7B processes tokens at nearly the same speed as a 13B dense model despite having the knowledge capacity of a much larger model.

Mixtral 8x7B supports a 32K token context window, making it suitable for processing longer documents, extended conversations, and multi-file code analysis. The 8x22B variant also supports 65K context. Both use grouped-query attention for efficient KV-cache management during inference.

The instruct variants of both models demonstrate strong instruction-following capabilities, tool use, and structured output generation. Mixtral 8x7B Instruct was one of the first open-weight models to achieve GPT-3.5-level performance on the Chatbot Arena leaderboard, validating the MoE approach for practical assistant applications.

Fine-Tuning with Ertas

Fine-tuning Mixtral 8x7B in Ertas Studio requires careful consideration of the MoE architecture. While the model activates only 12.9B parameters per token, all 46.7B parameters must be loaded into memory. With QLoRA at 4-bit quantization, fine-tuning requires approximately 28-32GB VRAM — achievable on a single A100 40GB GPU or dual RTX 4090 GPUs. Ertas Studio handles the MoE-aware LoRA adapter placement automatically, targeting the active expert layers and shared attention components.

For Mixtral 8x22B, QLoRA fine-tuning requires approximately 80-90GB VRAM, necessitating an A100 80GB or multi-GPU setup. Despite the higher memory requirements, the training throughput is good because gradient computation only flows through the active experts for each token.

Ertas Studio's visual interface makes configuring MoE fine-tuning straightforward. Select Mixtral as your base model, upload your dataset, and the platform recommends appropriate LoRA rank and target modules. After training, export to GGUF format and deploy through Ollama or llama.cpp, which both support MoE inference natively.

Use Cases

Mixtral 8x7B is an excellent choice for production deployments where you need significantly better quality than 7B models but cannot afford the inference cost of dense 70B models. It excels at complex instruction following, multi-step reasoning, and code generation while maintaining fast inference speed. Common deployments include API-serving scenarios, enterprise chatbots, and RAG-augmented knowledge systems.

The 8x22B variant targets high-capability applications: advanced code generation and review, technical writing, research analysis, and complex multi-turn problem-solving. Organizations that need near-frontier-model quality while keeping data on-premise often choose Mixtral 8x22B as their primary model.

Both variants perform well on multilingual tasks, supporting fluent generation in English, French, Italian, German, and Spanish. This makes Mixtral a strong choice for international organizations that need a single model serving multiple language markets.

Hardware Requirements

Mixtral 8x7B at Q4_K_M quantization requires approximately 26GB of RAM. Despite only activating 13B parameters per token, all 47B parameters must reside in memory since different tokens may route to different experts. This makes it runnable on systems with 32GB+ RAM for CPU inference, or on GPUs like the RTX 4090 24GB (tight fit) or A6000 48GB. At Q8_0, expect approximately 50GB.

Mixtral 8x22B at Q4_K_M requires approximately 80GB, suitable for A100 80GB or multi-GPU setups. At Q8_0, the requirement grows to approximately 150GB, typically requiring 2-4 high-VRAM GPUs or large-memory CPU inference.

Inference speed for Mixtral is excellent relative to model quality because only active expert weights are computed per token. On an A100 80GB, Mixtral 8x7B typically achieves 40-60 tokens per second for generation, comparable to running a 13B dense model. CPU inference on modern hardware (e.g., M2 Ultra or Threadripper) with Q4_K_M typically yields 15-25 tokens per second.

Supported Quantizations

Q4_0Q4_K_MQ5_K_MQ6_KQ8_0F16

Related Resources

Integration

llama.cpp

Integration

LM Studio

Integration

Ollama

Ship AI that runs on your users' devices.

Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

View early bird pricing or join the waitlist →