Mixture-of-Experts (MoE)
A neural-network architecture pattern where each input token is dynamically routed to a small subset of specialized "expert" sub-networks rather than activating every parameter in the model. Models declare a large *total* parameter count (knowledge capacity) but only a fraction is *activated* per forward pass (compute cost). DeepSeek-V3 (671B total / 37B activated, 257 experts where 1 is shared and 8 are routed per layer) is the reference 2026 implementation; other 2026 MoE shapes include Mixtral, Llama-4 MoE variants, and Qwen-MoE.
Definition
A neural-network architecture pattern where each input token is dynamically routed to a small subset of specialized "expert" sub-networks rather than activating every parameter in the model. Models declare a large *total* parameter count (knowledge capacity) but only a fraction is *activated* per forward pass (compute cost). DeepSeek-V3 (671B total / 37B activated, 257 experts where 1 is shared and 8 are routed per layer) is the reference 2026 implementation; other 2026 MoE shapes include Mixtral, Llama-4 MoE variants, and Qwen-MoE.
Frontier reasoning models with constrained inference budgets, training under export-controlled hardware (Chinese AI labs forced onto H800/H20 reserves rather than H100/H200), serving large-context LLMs at lower per-token cost than dense equivalents, multi-expert architectures where different experts specialize in different domains.
Connections 4
Outbound 4
scoped_to1accelerates1enables1constrained_by1