Model Class

Mixture-of-Experts (MoE)

A neural-network architecture pattern where each input token is dynamically routed to a small subset of specialized "expert" sub-networks rather than activating every parameter in the model. Models declare a large *total* parameter count (knowledge capacity) but only a fraction is *activated* per forward pass (compute cost). DeepSeek-V3 (671B total / 37B activated, 257 experts where 1 is shared and 8 are routed per layer) is the reference 2026 implementation; other 2026 MoE shapes include Mixtral, Llama-4 MoE variants, and Qwen-MoE.

4 connections 1 post

Definition

What it is

A neural-network architecture pattern where each input token is dynamically routed to a small subset of specialized "expert" sub-networks rather than activating every parameter in the model. Models declare a large *total* parameter count (knowledge capacity) but only a fraction is *activated* per forward pass (compute cost). DeepSeek-V3 (671B total / 37B activated, 257 experts where 1 is shared and 8 are routed per layer) is the reference 2026 implementation; other 2026 MoE shapes include Mixtral, Llama-4 MoE variants, and Qwen-MoE.

Primary use cases

Frontier reasoning models with constrained inference budgets, training under export-controlled hardware (Chinese AI labs forced onto H800/H20 reserves rather than H100/H200), serving large-context LLMs at lower per-token cost than dense equivalents, multi-expert architectures where different experts specialize in different domains.

Connections 4

Outbound 4

Featured in