Architecture

Prefill-Decode Disaggregation

An LLM-serving architecture pattern that splits the two compute phases of transformer inference — **prefill** (compute-bound, processes the entire prompt in one forward pass to fill the KV-cache) and **decode** (memory-bandwidth-bound, generates one token per pass over the existing KV-cache) — into separate worker pools, each optimized for its phase. The completed KV-cache is shipped from prefill workers to decode workers via RDMA, NVLink, or (with CacheGen-style compression) commodity Ethernet.

11 connections 1 post

Definition

What it is

An LLM-serving architecture pattern that splits the two compute phases of transformer inference — **prefill** (compute-bound, processes the entire prompt in one forward pass to fill the KV-cache) and **decode** (memory-bandwidth-bound, generates one token per pass over the existing KV-cache) — into separate worker pools, each optimized for its phase. The completed KV-cache is shipped from prefill workers to decode workers via RDMA, NVLink, or (with CacheGen-style compression) commodity Ethernet.

Why it exists

A single GPU pool serving both prefill and decode is fundamentally unbalanced. Prefill saturates compute but barely touches HBM bandwidth; decode saturates HBM bandwidth but underutilizes compute. The two phases interfere with each other under shared scheduling — prefill batches starve decode tail latency, decode batches block prefill throughput. Disaggregation lets each pool run at peak utilization, and lets the operator buy *different* hardware for each (compute-heavy B200 for prefill, memory-bandwidth-heavy H100 for decode).

Primary use cases

High-throughput production LLM serving at scale (any deployment ≥4 GPUs benefits), multi-tenant platforms with mixed prompt-length distributions, long-context serving where prefill amortization matters most, agentic workflows where prefix-cache hits make prefill skippable (further leverage from disaggregation).

Recent developments

Latest signals
  • Mooncake formalized the pattern in 2024; production adoption accelerated through 2025-2026. The Mooncake paper from Moonshot AI is the canonical architectural reference; every major serving runtime now ships disaggregated-serving support. Per arXiv 2407.00079 — Mooncake: KV-cache-centric architecture for LLM serving.
  • vLLM and TensorRT-LLM both ship disaggregated serving. Both runtimes added native disaggregated executor pools with NIXL-based RDMA cache transport between them. Per the vLLM repo and TensorRT-LLM repo.
  • DeepSeek-V3 ships with disaggregation as the reference serving topology. Prefill on compute-heavy nodes, decode on memory-heavy nodes, with MLA shrinking the KV-cache so cross-node transport stays tractable. Per the DeepSeek-V3 repo and DeepSeek-V3 technical report (arXiv 2412.19437).
  • CacheGen-compressed transport enables disaggregation over commodity Ethernet. MLA (high cache compression) + CacheGen (further wire compression) make prefill-decode disaggregation viable over standard 100GbE without RDMA, lowering the deployment-hardware bar. Per arXiv 2310.07240 — CacheGen.

Connections 11

Outbound 6
Inbound 5

Featured in