Back to blog
    Mixture of Experts in 2026: From Mixtral to DeepSeek V4
    moemixture-of-expertsarchitecturedeepseekkimimistralqwen

    Mixture of Experts in 2026: From Mixtral to DeepSeek V4

    MoE has become the default architecture for flagship open-weight models in 2026 — DeepSeek V4, Kimi K2.6, MiMo V2.5 Pro, GPT-OSS, Mistral Small 4 all use it. Here's why, how the design choices have evolved, and what it means for production deployments.

    EErtas Team·

    Two years ago, mixture-of-experts (MoE) was an experimental architectural choice that a handful of frontier labs were exploring tentatively. Mixtral 8x7B was newsworthy precisely because it was unusual. By April 2026, MoE has become the default architecture for flagship open-weight models. Every model in the current open-weight top tier — DeepSeek V4, Kimi K2.6, MiMo V2.5 Pro, GPT-OSS-120B, Mistral Small 4, Qwen 3.5-397B-A17B — uses an MoE architecture. Pure dense models above 70B are increasingly the exception rather than the norm.

    This article covers what changed, how the architectural choices have evolved, and what the shift means for teams making production deployment decisions in 2026.

    The Basic Idea (For Readers New to MoE)

    A standard transformer layer applies the same feedforward computation to every token. A 70B-parameter dense model uses all 70B parameters for every token it processes — most of them are irrelevant for any given token, but the architecture activates them all anyway.

    A mixture-of-experts layer replaces the single feedforward block with multiple parallel "experts" plus a small routing network. For each token, the router decides which experts (typically 1-8 out of dozens or hundreds) should process it, and only those experts are activated. The total parameter count of the layer is the sum of all experts, but the active parameter count for any single token is much smaller.

    The practical effect: a 1T-parameter MoE model with 32B active parameters has the inference cost of a 32B dense model — token generation throughput, GPU utilization, latency are all approximately what you'd expect from a 32B dense model. But the model has 1T parameters of capacity available, and the router learns to route different types of tokens to different specialized experts. The result, when training works well, is a model that delivers quality comparable to a much larger dense model at substantially better inference economics.

    The trade-off: total memory footprint scales with total parameter count, not active parameter count. You still need to load all expert weights into memory even though only a subset is active per token. This typically means MoE models require more VRAM than dense models of equivalent inference cost.

    The Mixtral Era (Late 2023 – Early 2025)

    Mixtral 8x7B (December 2023) and Mixtral 8x22B (April 2024) from Mistral established the MoE pattern in the open-weight ecosystem. Both used a top-2 expert routing strategy across 8 experts, with active parameter counts of approximately 12.9B and 39B respectively, against total counts of 46.7B and 141B.

    The Mixtral models established several important conventions:

    Top-K routing. Each token is routed to a fixed K experts (top-2 in Mixtral's case). This balances parallelism (you can compute multiple experts in parallel) against efficiency (more experts means more compute per token).

    Load balancing. The router learns to distribute tokens roughly evenly across experts. Without explicit load-balancing pressure, MoE training tends to collapse into a few "popular" experts that handle most tokens — defeating the purpose of having many experts. Mixtral introduced auxiliary load-balancing losses during training to prevent collapse.

    Expert dimensionality matches dense layers. Mixtral's experts had the same hidden dimensions as the equivalent dense feedforward block. This made the architecture conceptually simple: an MoE layer is "just a dense layer with multiple parallel copies and a router."

    The Mixtral models showed that MoE could deliver competitive quality at favorable inference economics, but the design space they explored was relatively narrow. Subsequent work substantially expanded that space.

    The Fine-Grained MoE Era (Mid 2025 – 2026)

    DeepSeek V3 (December 2024) and the Qwen 3 family (early 2025) ushered in a meaningfully different MoE design pattern: fine-grained MoE. The key shift was using many more, much smaller experts and routing to more of them per token.

    The DeepSeek V3 architecture uses 256 routed experts per layer plus 1 shared expert, with top-8 routing. Compared to Mixtral's 8 experts with top-2 routing, this is a fundamentally different design space:

    • More experts means each expert can specialize more narrowly
    • Smaller experts means each one is computationally cheaper
    • Higher top-K means each token sees more diverse expert contributions
    • Shared experts capture common patterns that don't need to be replicated across all routed experts

    The result is a model that delivers better quality per active parameter than Mixtral-era designs. DeepSeek V3 with 671B total / 37B active substantially outperforms Mixtral 8x22B (141B total / 39B active) on benchmarks at similar inference cost — the architectural improvements yielded measurable quality gains independent of the parameter count differences.

    Qwen 3 introduced its own variant with the 30B-A3B and 235B-A22B configurations. The 30B-A3B uses 128 experts with top-2 routing — similar in spirit to fine-grained MoE but with different specific design choices. The 3B active parameter count made this variant exceptionally efficient for production serving while delivering quality that matched or exceeded much larger dense models.

    By 2026, fine-grained MoE has become the de facto standard. New flagship releases use total / active ratios in the 20:1 to 200:1 range — DeepSeek V4 Pro at 1.6T / 49B (33:1), Kimi K2.6 at 1T / 32B (31:1), Mistral Small 4 at 119B / 6B (20:1), GPT-OSS-120B at 117B / 5.1B (23:1).

    DeepSeek Sparse Attention: MoE Beyond Feedforward

    The most significant 2026-era architectural innovation isn't strictly an MoE advancement — it's the application of expert-style sparse routing to attention layers. DeepSeek Sparse Attention (DSA), introduced in DeepSeek V3.2 and continued in V4, applies a learned sparse routing pattern to attention: each query token learns to attend to a subset of key tokens rather than the full sequence.

    Conceptually, DSA extends the MoE philosophy from feedforward layers to attention. Standard transformer attention computes pairwise interactions across all token pairs — quadratic compute and memory cost. DSA computes only the interactions deemed relevant by a learned routing mechanism, which substantially reduces both compute and memory cost for long-context inference while maintaining usable retrieval quality.

    The practical implication: DSA is a key reason DeepSeek V4 can support a 1M-token context window in production. Naive dense attention at 1M tokens would be prohibitively expensive in both compute and KV-cache memory. DSA makes long-context inference economically tractable, and the architectural pattern is likely to spread to other model families as 1M+ context becomes a baseline expectation.

    What Drove the Shift

    Several factors drove MoE from experimental to default in this two-year window:

    Better inference economics at frontier scale. As frontier-quality models grew past 70B dense parameters, the inference costs of pure dense architectures became prohibitive. A 405B dense model needs to activate 405B parameters per token, requiring server-class infrastructure and producing high inference cost per request. A 1T MoE model with 32B active offers similar quality at the inference economics of a 32B dense model. For production deployments where token cost matters, this is a fundamental advantage.

    Improved load balancing techniques. Early MoE training was notoriously unstable — the router would collapse into a few popular experts, training would diverge, and the resulting model would be worse than a dense model of equivalent compute. Improvements in auxiliary load-balancing losses, expert capacity factors, and router temperature schedules have made MoE training substantially more reliable. Modern MoE training is now closer to "set sensible defaults and let it run" rather than requiring constant intervention.

    Hardware improvements. Frontier hardware (H100, H200, MI300X, Ascend variants) has substantially better support for the kind of sparse compute patterns that MoE produces. Earlier hardware generations made MoE less efficient than the theoretical analysis suggested; current hardware closes much of that gap.

    Quantization compatibility. MoE models quantize reasonably well — Q4_K_M quantization preserves usable quality on MoE flagships, similar to dense models. Earlier concerns that MoE expert specialization would interact badly with aggressive quantization haven't panned out in practice.

    Practical Implications for Deployments

    For teams making production deployment decisions, the MoE shift has several implications:

    Memory and inference cost decouple. With dense models, a 70B model is "70B-class" both in memory cost and inference cost. With MoE, a 1T-A32B model is 1T-class in memory cost but 32B-class in inference throughput. Capacity planning needs to track both axes — memory determines how many GPUs you need to host the model, while active parameter count determines how fast it serves requests.

    Multi-GPU server deployment is the norm at the frontier. The trillion-parameter MoE tier (DeepSeek V4, Kimi K2.6, MiMo V2.5 Pro) requires 8-GPU server configurations (8x A100 80GB or 8x H100 80GB) for production deployment at full quality. Single-GPU deployment is unrealistic at this tier. Smaller MoE flagships (100-200B total parameters with 5-30B active) fit on single 80GB GPUs.

    Fine-tuning economics improve. The lower active parameter count translates to better fine-tuning economics for QLoRA training. A 35B-A3B MoE fine-tunes faster per training step than a 14B dense model because the active parameter count drives training-time compute. Mistral Small 4's 6B active parameter count makes it exceptionally efficient to fine-tune relative to its 119B total — QLoRA fits on a 24GB GPU at full sequence lengths.

    Architecture-aware tooling matters. Inference frameworks (vLLM, TensorRT-LLM, llama.cpp) have varying levels of MoE optimization. The frontier frameworks support MoE architectures as first-class options with optimized kernels for expert routing and load balancing; older deployment patterns may not extract full performance from MoE models. For production deployment, choose tools that have first-class MoE support.

    Quantization sweet spots differ. Some MoE architectures quantize particularly well; others have specific layers that don't quantize cleanly below Q4_K_M. The interaction between fine-grained MoE routing and aggressive quantization is genuinely model-specific. Test the quantization tier you actually plan to deploy before committing — assumptions from dense-model experience don't always transfer.

    Looking Forward

    MoE is now a mature architectural pattern, not an experiment. The base case for the next 24 months is that MoE remains the dominant flagship architecture, with continued refinement in routing strategies, expert sizing, and integration with sparse attention mechanisms. Several specific developments seem likely:

    Lower active parameter ratios. The trend through 2025-2026 has been toward lower active parameter counts at equivalent quality. Mistral Small 4's 6B active and GPT-OSS's 5.1B active push the boundary of how efficient MoE inference can be. Expect this to continue — the industry will keep pushing toward MoE designs that deliver more quality per active parameter.

    Tighter integration with sparse attention. DSA in DeepSeek V4 demonstrates that the MoE philosophy extends beyond feedforward layers. Other model families are likely to adopt similar approaches, particularly as 1M+ context becomes a baseline expectation. The combination of sparse attention plus sparse feedforward could substantially reduce inference cost at frontier scale.

    Specialized expert pre-training. Current MoE models train experts jointly with the rest of the architecture. There's research interest in pre-training experts with explicit specialization (math experts, code experts, language experts), then composing them into a final model. Whether this approach delivers quality competitive with joint training is still an open question, but it could enable interesting deployment patterns where teams swap in specialized experts for specific use cases.

    Better quantization for MoE. Current quantization techniques treat all experts uniformly. There's likely substantial room for improvement in quantization that's aware of expert routing patterns — quantizing rarely-used experts more aggressively while preserving precision on heavily-used ones. Whether this materializes as standard tooling remains to be seen.

    For teams building production AI infrastructure in 2026, the practical takeaway is that MoE is no longer an unusual architectural choice — it's the mainstream pattern, and infrastructure decisions should treat it as the default. Deployment tooling, monitoring, capacity planning, fine-tuning workflows, and quantization strategies should all assume MoE-flagship as the typical case. The teams that have made this shift are deploying frontier-quality models at substantially better economics than teams still operating in the pure-dense paradigm.

    Ship AI that runs on your users' devices.

    Early bird pricing starts at $14.50/mo — locked in for life. Plans for builders and agencies.

    Keep reading