Definition

What it is

The open-source LLM serving platform for **Kimi**, Moonshot AI's leading LLM product. Repository: [github.com/kvcache-ai/Mooncake](https://github.com/kvcache-ai/Mooncake). Mooncake's architectural distinguishing feature is **disaggregated prefill** — separating the prefill compute pool from the decode compute pool, with KV-cache state transferred between them via a dedicated storage layer (DRAM, NVMe, or S3-compatible object storage). This pattern is the structural answer to the "prefill is expensive, decode is memory-bound, they have different optimal hardware" tension.

Why it exists

Production LLM serving at scale (Moonshot AI claims hundreds of thousands of concurrent users on the Kimi service) requires architecture that doesn't waste prefill compute on decode-heavy workloads and vice versa. Mooncake formalized the disaggregated-prefill pattern in open-source form, making the architecture reproducible outside Moonshot AI's internal infrastructure.

Primary use cases

High-scale LLM serving with disaggregated prefill/decode, KV-cache transfer over object-storage substrates, Kimi-style long-context model serving, multi-tenant LLM platforms targeting cost-per-token optimization.

Recent developments

Latest signals

Mooncake paper published in ACM Transactions on Storage (TOS). "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving" — top-tier systems-storage journal venue formalizes the disaggregated-KVCache architecture pattern. Per ACM TOS — Mooncake and Tsinghua — Mooncake TOS 2025 PDF.
USENIX FAST '25 paper: "Trading More Storage for Less Computation." Mooncake's FAST paper makes the economic case explicit — KVCache disaggregation trades storage spend for compute spend, with measured 59%–498% effective-request-capacity gains under SLO constraints. Per USENIX FAST '25 — Mooncake: Trading More Storage for Less Computation.
Powers Kimi at thousands of nodes, 100B+ tokens/day. Production deployment scale: Mooncake is operational across thousands of nodes serving the Kimi chatbot, processing 100B+ tokens/day. Real-world scale-out validates the architectural claims. Per Mooncake docs.
vLLM officially features Mooncake Store. vLLM published a deep dive on how Mooncake's distributed KVCache engine supercharges vLLM inference with high-throughput, memory-efficient, cross-instance KV cache sharing. The two camps are now interoperating, not competing. Per vLLM + Mooncake Store deep dive.
Pools CPU + DRAM + SSD + NIC of the GPU cluster. "Mooncake Store" reuses underexploited resources of the GPU cluster — pools idle DRAM and SSD across nodes via RDMA so the global KV cache exceeds any single node's memory. The architectural insight: stop thinking about per-node memory; think cluster-wide. Per arXiv 2407.00079 — Mooncake KVCache-centric Disaggregated Architecture.
Long-context workloads are the killer use case. Mooncake increases effective request capacity 59%–498% specifically on long-context-input workloads — the exact regime where Kimi's product positioning (1M-token contexts) lives. Per arXiv 2407.00079.