DeepSeekMoE
The Mixture-of-Experts routing architecture used in DeepSeek V3 and derivative models. Two-tier expert structure: **1–2 shared experts** per layer activated for every token (handle generic capabilities) + **256 routed experts** per layer with **8 activated per token** (handle specialization). Critically, uses **auxiliary-loss-free load balancing** — instead of adding a load-balance-loss term that degrades training quality, DeepSeekMoE adjusts per-expert bias terms dynamically (decrease γ for overloaded experts, increase γ for underloaded experts), achieving balanced expert utilization without the auxiliary-loss penalty.
Definition
The Mixture-of-Experts routing architecture used in DeepSeek V3 and derivative models. Two-tier expert structure: **1–2 shared experts** per layer activated for every token (handle generic capabilities) + **256 routed experts** per layer with **8 activated per token** (handle specialization). Critically, uses **auxiliary-loss-free load balancing** — instead of adding a load-balance-loss term that degrades training quality, DeepSeekMoE adjusts per-expert bias terms dynamically (decrease γ for overloaded experts, increase γ for underloaded experts), achieving balanced expert utilization without the auxiliary-loss penalty.
Pre-V3 MoE training had a structural problem: enforcing load balancing via an auxiliary loss term traded off raw model quality for training stability. Without load balancing, a few experts get all the traffic ("expert collapse"); with load balancing via auxiliary loss, quality regresses. DeepSeekMoE's bet — that dynamic bias-based balancing can avoid both failure modes — turned out to be the key unlock for cheap frontier-scale MoE training. The architecture is now the canonical sparse-expert design for open-weight 2026 frontier models.
Frontier-scale MoE training (DeepSeek V3, R1, derivative work), models requiring strong specialization without expert-collapse failures, any sparse-expert architecture where the auxiliary-loss term has historically degraded downstream quality, and as a reference architecture for non-DeepSeek MoE designs (Kimi K2, GLM-5 borrow elements of the routing approach).
Recent developments
- 256 routed + 1-2 shared experts per layer with 8-active routing. The canonical DeepSeekMoE topology used in V3: 1-2 shared experts (always active) plus 256 routed experts of which 8 are activated per token. Per Medium — Understanding DeepSeek-V3 Architecture.
- Auxiliary-loss-free load balancing via dynamic bias adjustment. Per-expert bias terms are dynamically adjusted (γ decrement when overloaded, γ increment when underloaded), eliminating the performance degradation that traditional load-balance-loss MoE training introduces. Per arXiv 2408.15664 — Auxiliary-Loss-Free Load Balancing.
- Theoretical framework published 2025-26. A December 2025 arXiv paper formalizes the auxiliary-loss-free approach for large-scale sparse MoE models. Per arXiv 2512.03915 — Theoretical Framework.
- Now the canonical MoE design for 2026 open-weight frontier. Kimi K2 (384 experts: 8 selected + 1 shared per layer) and GLM-5 both follow DeepSeekMoE-style topologies. Per DeepSeek-V3 Technical Report (arXiv 2412.19437).
- Cameron R. Wolfe MoE LLMs deep dive. Comprehensive 2025 substack writeup positions DeepSeekMoE as the reference architecture for sparse-expert LLM training. Per Cameron R. Wolfe — MoE LLMs.
Connections 6
Outbound 4
scoped_to1enables3Inbound 2
implements1