Memory Efficient Attention
An umbrella architectural family of attention computation methods that reduce the memory footprint of the attention operation from O(N²) toward O(N), including **FlashAttention** (kernel-level tile-based recomputation), **PagedAttention** (block-allocated KV-cache management), **Multi-Query Attention (MQA)** and **Grouped-Query Attention (GQA)** (head-count reduction on K and V), and **Multi-head Latent Attention (MLA)** (compressed shared latent representation). Each represents a different point on the memory-quality-throughput Pareto frontier; modern LLM architectures combine multiple.
Definition
An umbrella architectural family of attention computation methods that reduce the memory footprint of the attention operation from O(N²) toward O(N), including **FlashAttention** (kernel-level tile-based recomputation), **PagedAttention** (block-allocated KV-cache management), **Multi-Query Attention (MQA)** and **Grouped-Query Attention (GQA)** (head-count reduction on K and V), and **Multi-head Latent Attention (MLA)** (compressed shared latent representation). Each represents a different point on the memory-quality-throughput Pareto frontier; modern LLM architectures combine multiple.
Naive scaled-dot-product attention materializes an N×N attention score matrix and an N×N value-weighted output, consuming O(N²) memory and bandwidth. For N=128k, this is hundreds of GB per layer — infeasible. Memory Efficient Attention is the foundational architectural family that lets transformers scale to 100k+ context windows on practical hardware, and it is the *prerequisite* architecture for every long-context model on the market.
Every modern transformer language model uses at least one Memory Efficient Attention variant — Llama 3/4 use GQA + FlashAttention, DeepSeek-V3/V4 use MLA + FlashAttention, Gemma 4 uses shared-KV layers + FlashAttention. The architecture is invisible to model users but determines what context lengths are feasible at what cost.
Recent developments
- The taxonomy itself is now teaching material. Comprehensive 2026 survey papers and tutorials (HuggingFace, NeurIPS workshops, Sebastian Raschka's "Big LLM Architecture Comparison") position MQA / GQA / MLA / Shared-KV as four points on a single design axis: how much do you compress K and V? Per Sebastian Raschka — Big LLM architecture comparison 2026.
- MLA + FP8 stack (SnapMLA) emerging as the new default for long-context. The combination of MLA's structural compression + FP8 algorithmic compression dominates new model releases. Per arXiv 2602.10718 — SnapMLA.
- FlashAttention 3 + 4 ship with native Hopper TMA + Blackwell async-warp support. FlashAttention is now hardware-coupled — each generation tracks the underlying Tensor Memory Accelerator architecture. Per arXiv 2407.08608 — FlashAttention-3.
- PagedAttention won as the cache-management standard. Every serving runtime (vLLM, TensorRT-LLM, SGLang, MLC-LLM) now uses block-based KV-cache management; pre-PagedAttention contiguous allocation is legacy. Per vLLM project.