Memory Efficient Attention

Definition

What it is

An umbrella architectural family of attention computation methods that reduce the memory footprint of the attention operation from O(N²) toward O(N), including **FlashAttention** (kernel-level tile-based recomputation), **PagedAttention** (block-allocated KV-cache management), **Multi-Query Attention (MQA)** and **Grouped-Query Attention (GQA)** (head-count reduction on K and V), and **Multi-head Latent Attention (MLA)** (compressed shared latent representation). Each represents a different point on the memory-quality-throughput Pareto frontier; modern LLM architectures combine multiple.

Why it exists

Naive scaled-dot-product attention materializes an N×N attention score matrix and an N×N value-weighted output, consuming O(N²) memory and bandwidth. For N=128k, this is hundreds of GB per layer — infeasible. Memory Efficient Attention is the foundational architectural family that lets transformers scale to 100k+ context windows on practical hardware, and it is the *prerequisite* architecture for every long-context model on the market.

Primary use cases

Every modern transformer language model uses at least one Memory Efficient Attention variant — Llama 3/4 use GQA + FlashAttention, DeepSeek-V3/V4 use MLA + FlashAttention, Gemma 4 uses shared-KV layers + FlashAttention. The architecture is invisible to model users but determines what context lengths are feasible at what cost.

Recent developments

Latest signals

The taxonomy itself is now teaching material. Comprehensive 2026 survey papers and tutorials (HuggingFace, NeurIPS workshops, Sebastian Raschka's "Big LLM Architecture Comparison") position MQA / GQA / MLA / Shared-KV as four points on a single design axis: how much do you compress K and V? Per Sebastian Raschka — Big LLM architecture comparison 2026.
MLA + FP8 stack (SnapMLA) emerging as the new default for long-context. The combination of MLA's structural compression + FP8 algorithmic compression dominates new model releases. Per arXiv 2602.10718 — SnapMLA.
FlashAttention 3 + 4 ship with native Hopper TMA + Blackwell async-warp support. FlashAttention is now hardware-coupled — each generation tracks the underlying Tensor Memory Accelerator architecture. Per arXiv 2407.08608 — FlashAttention-3.
PagedAttention won as the cache-management standard. Every serving runtime (vLLM, TensorRT-LLM, SGLang, MLC-LLM) now uses block-based KV-cache management; pre-PagedAttention contiguous allocation is legacy. Per vLLM project.

Connections 5

Outbound 3

scoped_to2

AI Memory Infrastructure Object Storage

solves1

Memory Wall

Inbound 2

alternative_to1

Multi-Head Latent Attention (MLA)

depends_on1

GLM-5

Definition

Recent developments

Connections 5

Featured in