Architecture

Memory Efficient Attention

An umbrella architectural family of attention computation methods that reduce the memory footprint of the attention operation from O(N²) toward O(N), including **FlashAttention** (kernel-level tile-based recomputation), **PagedAttention** (block-allocated KV-cache management), **Multi-Query Attention (MQA)** and **Grouped-Query Attention (GQA)** (head-count reduction on K and V), and **Multi-head Latent Attention (MLA)** (compressed shared latent representation). Each represents a different point on the memory-quality-throughput Pareto frontier; modern LLM architectures combine multiple.

7 connections 1 post

Definition

What it is

An umbrella architectural family of attention computation methods that reduce the memory footprint of the attention operation from O(N²) toward O(N), including **FlashAttention** (kernel-level tile-based recomputation), **PagedAttention** (block-allocated KV-cache management), **Multi-Query Attention (MQA)** and **Grouped-Query Attention (GQA)** (head-count reduction on K and V), and **Multi-head Latent Attention (MLA)** (compressed shared latent representation). Each represents a different point on the memory-quality-throughput Pareto frontier; modern LLM architectures combine multiple.

Why it exists

Naive scaled-dot-product attention materializes an N×N attention score matrix and an N×N value-weighted output, consuming O(N²) memory and bandwidth. For N=128k, this is hundreds of GB per layer — infeasible. Memory Efficient Attention is the foundational architectural family that lets transformers scale to 100k+ context windows on practical hardware, and it is the *prerequisite* architecture for every long-context model on the market.

Primary use cases

Every modern transformer language model uses at least one Memory Efficient Attention variant — Llama 3/4 use GQA + FlashAttention, DeepSeek-V3/V4 use MLA + FlashAttention, Gemma 4 uses shared-KV layers + FlashAttention. The architecture is invisible to model users but determines what context lengths are feasible at what cost.

Recent developments

Latest signals
  • The taxonomy itself is now teaching material. Comprehensive 2026 survey papers and tutorials (HuggingFace, NeurIPS workshops, Sebastian Raschka's "Big LLM Architecture Comparison") position MQA / GQA / MLA / Shared-KV as four points on a single design axis: how much do you compress K and V? Per Sebastian Raschka — Big LLM architecture comparison 2026.
  • MLA + FP8 stack (SnapMLA) emerging as the new default for long-context. The combination of MLA's structural compression + FP8 algorithmic compression dominates new model releases. Per arXiv 2602.10718 — SnapMLA.
  • FlashAttention 3 + 4 ship with native Hopper TMA + Blackwell async-warp support. FlashAttention is now hardware-coupled — each generation tracks the underlying Tensor Memory Accelerator architecture. Per arXiv 2407.08608 — FlashAttention-3.
  • PagedAttention won as the cache-management standard. Every serving runtime (vLLM, TensorRT-LLM, SGLang, MLC-LLM) now uses block-based KV-cache management; pre-PagedAttention contiguous allocation is legacy. Per vLLM project.

Connections 7

Outbound 4
Inbound 3

Featured in