DeepSeekMoE | LLMS3

Definition

What it is

The Mixture-of-Experts routing architecture used in DeepSeek V3 and derivative models. Two-tier expert structure: **1–2 shared experts** per layer activated for every token (handle generic capabilities) + **256 routed experts** per layer with **8 activated per token** (handle specialization). Critically, uses **auxiliary-loss-free load balancing** — instead of adding a load-balance-loss term that degrades training quality, DeepSeekMoE adjusts per-expert bias terms dynamically (decrease γ for overloaded experts, increase γ for underloaded experts), achieving balanced expert utilization without the auxiliary-loss penalty.

Why it exists

Pre-V3 MoE training had a structural problem: enforcing load balancing via an auxiliary loss term traded off raw model quality for training stability. Without load balancing, a few experts get all the traffic ("expert collapse"); with load balancing via auxiliary loss, quality regresses. DeepSeekMoE's bet — that dynamic bias-based balancing can avoid both failure modes — turned out to be the key unlock for cheap frontier-scale MoE training. The architecture is now the canonical sparse-expert design for open-weight 2026 frontier models.

Primary use cases

Frontier-scale MoE training (DeepSeek V3, R1, derivative work), models requiring strong specialization without expert-collapse failures, any sparse-expert architecture where the auxiliary-loss term has historically degraded downstream quality, and as a reference architecture for non-DeepSeek MoE designs (Kimi K2, GLM-5 borrow elements of the routing approach).

Recent developments

Latest signals

DeepSeek V4 ships the architecture at 1.6T open weights — within 0.2 points of Claude Opus 4.6 on SWE-bench. The DeepSeek V4 preview (April 24, 2026) scales the DeepSeekMoE topology to V4-Pro at 1.6 trillion total parameters (49B active) plus a 284B V4-Flash — both open weights, making V4-Pro the largest open-weights model available. It posts 80.6% on SWE-bench Verified (within 0.2 pts of Claude Opus 4.6), 93.5% on LiveCodeBench, a Codeforces rating of 3206, and a 1M-token context (up from V3's 128K). DeepSeek claims it trails state-of-the-art closed models by only 3–6 months at a fraction of the access cost — the clearest evidence yet that the auxiliary-loss-free sparse-MoE bet scales to frontier without a closed moat. Per the DeepSeek V4-Pro complete guide.
256 routed + 1-2 shared experts per layer with 8-active routing. The canonical DeepSeekMoE topology used in V3: 1-2 shared experts (always active) plus 256 routed experts of which 8 are activated per token. Per Medium — Understanding DeepSeek-V3 Architecture.
Auxiliary-loss-free load balancing via dynamic bias adjustment. Per-expert bias terms are dynamically adjusted (γ decrement when overloaded, γ increment when underloaded), eliminating the performance degradation that traditional load-balance-loss MoE training introduces. Per arXiv 2408.15664 — Auxiliary-Loss-Free Load Balancing.
Theoretical framework published 2025-26. A December 2025 arXiv paper formalizes the auxiliary-loss-free approach for large-scale sparse MoE models. Per arXiv 2512.03915 — Theoretical Framework.
Now the canonical MoE design for 2026 open-weight frontier. Kimi K2 (384 experts: 8 selected + 1 shared per layer) and GLM-5 both follow DeepSeekMoE-style topologies. Per DeepSeek-V3 Technical Report (arXiv 2412.19437).
Cameron R. Wolfe MoE LLMs deep dive. Comprehensive 2025 substack writeup positions DeepSeekMoE as the reference architecture for sparse-expert LLM training. Per Cameron R. Wolfe — MoE LLMs.

Connections 8

Outbound 4

scoped_to1

AI Memory Infrastructure

enables3

DeepSeek V3 Kimi K2 GLM-5

Inbound 4

related_to1

Rollout Routing Replay (R3)

enables1

Auxiliary-Loss-Free Load Balancing

implements2

DeepSeek V3 DeepSeek V4

Definition

Recent developments

Connections 8

Featured in