Prefill-Decode Disaggregation
An LLM-serving architecture pattern that splits the two compute phases of transformer inference — **prefill** (compute-bound, processes the entire prompt in one forward pass to fill the KV-cache) and **decode** (memory-bandwidth-bound, generates one token per pass over the existing KV-cache) — into separate worker pools, each optimized for its phase. The completed KV-cache is shipped from prefill workers to decode workers via RDMA, NVLink, or (with CacheGen-style compression) commodity Ethernet.
Definition
An LLM-serving architecture pattern that splits the two compute phases of transformer inference — **prefill** (compute-bound, processes the entire prompt in one forward pass to fill the KV-cache) and **decode** (memory-bandwidth-bound, generates one token per pass over the existing KV-cache) — into separate worker pools, each optimized for its phase. The completed KV-cache is shipped from prefill workers to decode workers via RDMA, NVLink, or (with CacheGen-style compression) commodity Ethernet.
A single GPU pool serving both prefill and decode is fundamentally unbalanced. Prefill saturates compute but barely touches HBM bandwidth; decode saturates HBM bandwidth but underutilizes compute. The two phases interfere with each other under shared scheduling — prefill batches starve decode tail latency, decode batches block prefill throughput. Disaggregation lets each pool run at peak utilization, and lets the operator buy *different* hardware for each (compute-heavy B200 for prefill, memory-bandwidth-heavy H100 for decode).
High-throughput production LLM serving at scale (any deployment ≥4 GPUs benefits), multi-tenant platforms with mixed prompt-length distributions, long-context serving where prefill amortization matters most, agentic workflows where prefix-cache hits make prefill skippable (further leverage from disaggregation).
Recent developments
- Mooncake formalized the pattern in 2024; production adoption accelerated through 2025-2026. The Mooncake paper from Moonshot AI is the canonical architectural reference; every major serving runtime now ships disaggregated-serving support. Per arXiv 2407.00079 — Mooncake: KV-cache-centric architecture for LLM serving.
- vLLM and TensorRT-LLM both ship disaggregated serving. Both runtimes added native disaggregated executor pools with NIXL-based RDMA cache transport between them. Per the vLLM repo and TensorRT-LLM repo.
- DeepSeek-V3 ships with disaggregation as the reference serving topology. Prefill on compute-heavy nodes, decode on memory-heavy nodes, with MLA shrinking the KV-cache so cross-node transport stays tractable. Per the DeepSeek-V3 repo and DeepSeek-V3 technical report (arXiv 2412.19437).
- CacheGen-compressed transport enables disaggregation over commodity Ethernet. MLA (high cache compression) + CacheGen (further wire compression) make prefill-decode disaggregation viable over standard 100GbE without RDMA, lowering the deployment-hardware bar. Per arXiv 2310.07240 — CacheGen.
Connections 11
Outbound 6
scoped_to1optimizes_for1enables1solves1