Multi-Head Latent Attention (MLA)
A KV-cache compression technique for transformer attention, introduced in the DeepSeek-V2 paper and now the standard attention mechanism across DeepSeek V3, R1, Kimi K2, and the broader 2026 frontier-MoE ecosystem. Instead of sharing K/V tensors across query heads (the GQA approach), MLA performs **low-rank joint compression** of keys and values into a smaller latent space, storing only the compressed latent representation in the KV cache. At inference time the latent is projected back up to full dimensionality before computing attention.
Definition
A KV-cache compression technique for transformer attention, introduced in the DeepSeek-V2 paper and now the standard attention mechanism across DeepSeek V3, R1, Kimi K2, and the broader 2026 frontier-MoE ecosystem. Instead of sharing K/V tensors across query heads (the GQA approach), MLA performs **low-rank joint compression** of keys and values into a smaller latent space, storing only the compressed latent representation in the KV cache. At inference time the latent is projected back up to full dimensionality before computing attention.
Long-context inference is bottlenecked by KV-cache size. For a 128K-context model with standard Multi-Head Attention at DeepSeek-V3's dimensions, the KV cache alone would consume **~488 GB** — completely impractical for production serving. MLA stores a single length-512 latent per token, yielding a **64× smaller KV-cache footprint**. The trade-off historically was that compression hurts modeling quality, but DeepSeek demonstrated MLA achieves *better* modeling than standard MHA, which is why the DeepSeek team picked MLA over GQA and why every major open-weight frontier model since has adopted it.
Long-context inference where the KV-cache size dominates memory budget (128K+ contexts), agentic workloads that need to retain conversation state across many turns, production serving of MoE models where per-request memory matters more than tokens-per-second, retrofitting older transformer-based LLMs to support long context without retraining (per TransMLA), and any deployment where the memory savings justify the architectural complexity.
Recent developments
- 64× KV-cache footprint reduction at 128K context. With DeepSeek-V3's dimensions and 128K supported sequence length, standard MHA needs ~488GB KV cache; MLA stores a length-512 latent per token, 64× smaller. Per PyImageSearch — Build DeepSeek-V3 MLA.
- TransMLA — retrofit MLA onto any GQA-based LLM. February 2026 arXiv paper shows existing GQA-based transformer LLMs can be converted to MLA without full retraining, opening the technique to the existing pre-2025 frontier-model corpus. Per arXiv 2502.07864 — TransMLA.
- TowardsEconomicalInference — MLA enablement across transformer LLMs. A second 2025-26 arXiv paper formalizes how to enable MLA in any transformer-based LLM, generalizing the technique beyond DeepSeek-specific designs. Per arXiv 2502.14837.
- Adopted by every major 2026 open-weight frontier model. DeepSeek V2 → V3 → R1, Kimi K2 / K2.5 / K2.6, and the GLM-5 family all use MLA as the canonical attention mechanism for long-context inference. Per Sebastian Raschka — LLMs-from-scratch MLA chapter.
- Educational deep-dives published in 2025-2026. Independent technical writeups by Chris McCormick, PlanetBanatt, and Lior Sinai walk through the MLA forward/backward pass for educational purposes. Per McCormick — Inner Workings of MLA.
Connections 15
Outbound 5
scoped_to1enables2alternative_to1solves1Inbound 10
augments1competes_with1is_a2accelerates1compresses1alternative_to1enables1depends_on1implements1