Definition

What it is

A streaming KV-cache compression and transmission system from researchers at the University of Chicago that treats the KV-cache as a **bitstream** rather than a tensor. CacheGen applies layer-wise quantization (with per-layer bit budgets) and a custom entropy coder (arithmetic coding with cross-channel context modeling) to shrink KV-cache transmission size by 3-10x for the network shipment between prefill and decode workers in disaggregated serving.

Why it exists

Prefill-decode disaggregation requires shipping the full KV-cache from the prefill worker to the decode worker once prefill completes. For long contexts on large models, that's tens of GB per request — fast on NVLink (within a node), painful on InfiniBand (across nodes), and a non-starter on standard 100GbE. CacheGen's compression makes cross-node disaggregated serving practical on commodity Ethernet.

Primary use cases

Disaggregated LLM serving over commodity networks, edge-to-cloud agentic workflows where the KV-cache is precomputed at the edge and streamed to a cloud decoder, hierarchical KV-cache storage in object stores (S3, MinIO) where transmission cost dominates.

Recent developments

Latest signals

Foundational CacheGen paper. "CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving" introduced the streaming-oriented KV-cache compression scheme — layer-wise quantization with per-layer bit budgets, arithmetic coding with cross-channel context modeling. Per arXiv 2310.07240 — CacheGen.
Integrated as LMCache compression backend. LMCache ships CacheGen as an optional compression layer between its remote store and its transport. Per LMCache + CacheGen blog.
CacheGen reference implementation. Open-source repo (UChi-JCL/CacheGen) with the encoder + decoder + LMCache integration glue. Per GitHub — UChi-JCL/CacheGen.