Mixture-of-Experts (MoE)

Definition

What it is

A neural-network architecture pattern where each input token is dynamically routed to a small subset of specialized "expert" sub-networks rather than activating every parameter in the model. Models declare a large *total* parameter count (knowledge capacity) but only a fraction is *activated* per forward pass (compute cost). DeepSeek-V3 (671B total / 37B activated, 257 experts where 1 is shared and 8 are routed per layer) is the reference 2026 implementation; other 2026 MoE shapes include Mixtral, Llama-4 MoE variants, and Qwen-MoE.

Primary use cases

Frontier reasoning models with constrained inference budgets, training under export-controlled hardware (Chinese AI labs forced onto H800/H20 reserves rather than H100/H200), serving large-context LLMs at lower per-token cost than dense equivalents, multi-expert architectures where different experts specialize in different domains.

Recent developments

Latest signals

MoE dominates the 2026 LLM landscape. Mixture of Experts has become the dominant architecture pattern, powering models from Google (Gemini), Mistral (Mixtral), DeepSeek (V3/R1), and reportedly OpenAI + Meta. Per CallSphere — Mixture of Experts Architecture: Why MoE Dominates 2026 LLMs.
Trend: more experts with smaller individual capacity + shared expert layers. DeepSeek's 256-expert approach with 8 active per token typifies the 2026 trend — high-fanout routing + 1-2 shared experts that process every token alongside the routed experts.
Expert Choice routing inverts the token→expert assignment. Google's Expert Choice routing approach inverts standard token-to-expert assignment so experts choose their top tokens — achieves improved load balance vs token-driven top-K routing. Architectural alternative to DeepSeek's auxiliary-loss-free dynamic-bias approach. Per Cameron R. Wolfe — Mixture-of-Experts LLMs.
Path-Constrained MoE (PathMoE) for routing concentration. 2026 arXiv paper introduces Path-Constrained MoE — produces more concentrated path clusters + better cross-layer consistency + greater robustness to routing perturbations. Consistent improvements on perplexity + downstream tasks. Per arXiv 2603.18297 — Path-Constrained MoE.
Sparse MoE survey from algorithmic foundations to decentralized architectures. 2026 comprehensive arXiv survey on sparse MoE — covers the algorithmic foundations, the decentralized-architecture extensions, and vertical-domain applications. Reference for the field's 2026 state-of-art. Per arXiv 2602.08019 — Rise of Sparse MoE Survey.
MoxE — entropy-aware routing for efficient language modeling. Research direction: Mixture of xLSTM Experts with Entropy-Aware Routing — uses routing-decision entropy as a load-balancing signal. Per arXiv 2505.01459 — MoxE.

Connections 5

Outbound 4

scoped_to1

Object Storage for AI Data Pipelines

accelerates1

DeepSeek 3FS

enables1

Sovereign Storage

constrained_by1

GPU Starvation

Inbound 1

optimizes_for1

Rollout Routing Replay (R3)

Definition

Recent developments

Connections 5

Featured in