Prefill Tax | LLMS3

Definition

What it is

The compute cost required to process the input sequence before an LLM can generate the first output token. As prompts grow to hundreds of thousands or millions of tokens, the prefill phase dominates inference latency and cost — generating one token of output requires re-running attention over the entire input. The "tax" framing reflects that this work is non-optional and grows superlinearly in prompt length even for relatively short responses.

Recent developments

Latest signals

Prefill/decode disaggregation = 70% higher throughput + 88% faster TTFT. Real production numbers: disaggregated prefill (compute-hungry, runs on H100 SXM5 / B200 / B300) + decode (memory-hungry, runs on H200 SXM5) = 70% higher throughput, 88% faster time-to-first-token, sustained performance floor where monolithic deployments would have collapsed. Per Medium — Disaggregated LLM Inference (Thillai Chithambaram).
Disaggregation introduces KV-transfer bandwidth bottleneck — 2.1 Tbps for 32K Qwen3-235B. The trade-off: KV cache must transfer between prefill + decode nodes. Serving 32K-token requests with Qwen3-235B on a 64-node prefill cluster requires 2.1 Tbps of KV egress bandwidth. KV communication time can be up to 60% of total job completion time. Per Spheron — Prefill-Decode Disaggregation 2026 Guide.
NIXL = standard mechanism for KV transfer in vLLM + NVIDIA Dynamo. NIXL (NVIDIA Inference Xfer Library) transfers KV tensors between nodes using RDMA or TCP — the canonical 2026 mechanism for the prefill→decode handoff in both vLLM and NVIDIA Dynamo. Per Spheron — Prefill-Decode Disaggregation.
Cache-aware prefill-decode disaggregation (CPD) — up to 40% faster long-context serving. Together AI's cache-aware approach extends basic disaggregation with cache-locality awareness in the routing decision — yields up to 40% faster long-context LLM serving by colocating cached prefixes with their decode nodes. Per Together AI — Cache-Aware Disaggregated Inference.
KVServe — service-aware KV-cache compression for disaggregated serving. Research paper formalizes the KV-cache compression dimension of disaggregated LLM serving — extends the prefill/decode split with compression policies that adapt to the specific service's quality/latency trade-off. Per arXiv 2605.13734 — KVServe.
Stored KV-cache reuse as a context-augmented economy. Research direction: reusing KV-cache across requests with shared prefixes as the economic foundation for cheaper context-augmented LLM generation — RadixAttention-style prefix sharing taken to its production-economic conclusion. Per arXiv 2503.14647 — Reusing Stored KV Cache for Economical Context-Augmented Generation.

Connections 6

Outbound 2

scoped_to2

AI Memory Infrastructure Inference Locality

Inbound 4

optimizes_for4

LMCache SGLang Mooncake Inference Context Memory Storage (ICMS)

Definition

Recent developments

Connections 6

Featured in