DeepSeek V3
Open-weight 671B-parameter Mixture-of-Experts language model from DeepSeek AI. **37B activated per token** (5.5% activation ratio), 256 experts with 8 active per token. Adopts Multi-head Latent Attention (MLA) for KV-cache compression and the DeepSeekMoE architecture for routing. First extremely large model to validate FP8 training in production, cutting memory and doubling training throughput vs BF16/FP16. Pre-trained on 14.8T tokens, then SFT + RL stages. Released December 2024 under permissive license; the V3 architecture became the substrate for DeepSeek-R1, Kimi K2, and several other 2026 frontier open-weight models.
Definition
Open-weight 671B-parameter Mixture-of-Experts language model from DeepSeek AI. **37B activated per token** (5.5% activation ratio), 256 experts with 8 active per token. Adopts Multi-head Latent Attention (MLA) for KV-cache compression and the DeepSeekMoE architecture for routing. First extremely large model to validate FP8 training in production, cutting memory and doubling training throughput vs BF16/FP16. Pre-trained on 14.8T tokens, then SFT + RL stages. Released December 2024 under permissive license; the V3 architecture became the substrate for DeepSeek-R1, Kimi K2, and several other 2026 frontier open-weight models.
Pre-V3, training a frontier-class open-weight model was widely assumed to require $50M+ in compute. DeepSeek V3 inverted that — published training cost under $6M, achieved by FP8 mixed-precision training + auxiliary-loss-free MoE load balancing + multi-token-prediction training objective. The release reset the cost curve for the entire open-weight ecosystem and made it economically feasible for non-hyperscalers (Moonshot, Zhipu, Alibaba) to train comparable models. The 2026 framing is that V3 is the architectural template — most open-weight frontier models that followed adopted some combination of its MLA, MoE routing, and FP8 training contributions.
Self-hosted frontier inference where the 671B/37B model fits on a 4×H100 node, derivative model training (R1, Kimi K2 substrate work), agentic coding pipelines via vLLM or SGLang, large-context RAG over enterprise corpora (128K context via MLA compression), and FP8-training research where V3's open weights serve as the reference.
Recent developments
- MLPerf Training v6.0 — DeepSeek V3 used as MoE pretraining benchmark. MLCommons added a DeepSeek-V3 large-scale MoE pretraining benchmark to the official MLPerf Training v6.0 suite (May 2026), formalizing V3 as the industry reference for sparse-expert training performance. Per MLCommons announcement.
- Locally-runnable on commodity hardware — DEV community guide. A May 2026 walkthrough demonstrates running the 671B V3 model on consumer-grade multi-GPU rigs via aggressive quantization. Per DEV — DeepSeek V3 671B locally.
- Architectural innovation: auxiliary-loss-free load balancing. V3 pioneers an auxiliary-loss-free strategy for MoE load balancing — eliminating the load-balance-loss term that destabilizes traditional MoE training. Multi-token-prediction (MTP) training objective added for stronger downstream performance. Per DeepSeek V3 Technical Report (arXiv).
- MLA mechanism enables 128K context at reasonable inference cost. Multi-head Latent Attention compresses the KV-cache into a low-dimensional latent space — making long-context inference practical without the memory blowup of standard attention. Per Medium — DeepSeek V3 architecture writeup.
- Available on HuggingFace + Azure AI Foundry. Weights distributed via HuggingFace; Azure ships V3 in the AI Foundry catalog for enterprise procurement. Per Hugging Face (deepseek-ai/DeepSeek-V3) and Azure AI Foundry.
Connections 14
Outbound 4
scoped_to1implements2enables1Inbound 10
enables5depends_on1