Architecture

Multi-Head Latent Attention (MLA)

A KV-cache compression technique for transformer attention, introduced in the DeepSeek-V2 paper and now the standard attention mechanism across DeepSeek V3, R1, Kimi K2, and the broader 2026 frontier-MoE ecosystem. Instead of sharing K/V tensors across query heads (the GQA approach), MLA performs **low-rank joint compression** of keys and values into a smaller latent space, storing only the compressed latent representation in the KV cache. At inference time the latent is projected back up to full dimensionality before computing attention.

15 connections

Definition

What it is

A KV-cache compression technique for transformer attention, introduced in the DeepSeek-V2 paper and now the standard attention mechanism across DeepSeek V3, R1, Kimi K2, and the broader 2026 frontier-MoE ecosystem. Instead of sharing K/V tensors across query heads (the GQA approach), MLA performs **low-rank joint compression** of keys and values into a smaller latent space, storing only the compressed latent representation in the KV cache. At inference time the latent is projected back up to full dimensionality before computing attention.

Why it exists

Long-context inference is bottlenecked by KV-cache size. For a 128K-context model with standard Multi-Head Attention at DeepSeek-V3's dimensions, the KV cache alone would consume **~488 GB** — completely impractical for production serving. MLA stores a single length-512 latent per token, yielding a **64× smaller KV-cache footprint**. The trade-off historically was that compression hurts modeling quality, but DeepSeek demonstrated MLA achieves *better* modeling than standard MHA, which is why the DeepSeek team picked MLA over GQA and why every major open-weight frontier model since has adopted it.

Primary use cases

Long-context inference where the KV-cache size dominates memory budget (128K+ contexts), agentic workloads that need to retain conversation state across many turns, production serving of MoE models where per-request memory matters more than tokens-per-second, retrofitting older transformer-based LLMs to support long context without retraining (per TransMLA), and any deployment where the memory savings justify the architectural complexity.

Recent developments

Latest signals
  • 64× KV-cache footprint reduction at 128K context. With DeepSeek-V3's dimensions and 128K supported sequence length, standard MHA needs ~488GB KV cache; MLA stores a length-512 latent per token, 64× smaller. Per PyImageSearch — Build DeepSeek-V3 MLA.
  • TransMLA — retrofit MLA onto any GQA-based LLM. February 2026 arXiv paper shows existing GQA-based transformer LLMs can be converted to MLA without full retraining, opening the technique to the existing pre-2025 frontier-model corpus. Per arXiv 2502.07864 — TransMLA.
  • TowardsEconomicalInference — MLA enablement across transformer LLMs. A second 2025-26 arXiv paper formalizes how to enable MLA in any transformer-based LLM, generalizing the technique beyond DeepSeek-specific designs. Per arXiv 2502.14837.
  • Adopted by every major 2026 open-weight frontier model. DeepSeek V2 → V3 → R1, Kimi K2 / K2.5 / K2.6, and the GLM-5 family all use MLA as the canonical attention mechanism for long-context inference. Per Sebastian Raschka — LLMs-from-scratch MLA chapter.
  • Educational deep-dives published in 2025-2026. Independent technical writeups by Chris McCormick, PlanetBanatt, and Lior Sinai walk through the MLA forward/backward pass for educational purposes. Per McCormick — Inner Workings of MLA.

Connections 15

Outbound 5
Inbound 10
augments1
accelerates1
compresses1
alternative_to1
depends_on1
implements1