ObjectCache
A research-prototype architecture for **layerwise persistence of LLM KV-cache to S3-compatible object storage**, exploiting the observation that decoder-only transformer layers can be retrieved on demand during decode if the retrieval is pipelined with the prior layer's attention compute. ObjectCache stores each layer's KV slice as an independent object in S3, and the inference runtime fetches layer *i+1* while attention on layer *i* is in flight, hiding object-store latency behind GPU compute.
Definition
A research-prototype architecture for **layerwise persistence of LLM KV-cache to S3-compatible object storage**, exploiting the observation that decoder-only transformer layers can be retrieved on demand during decode if the retrieval is pipelined with the prior layer's attention compute. ObjectCache stores each layer's KV slice as an independent object in S3, and the inference runtime fetches layer *i+1* while attention on layer *i* is in flight, hiding object-store latency behind GPU compute.
Existing KV-cache tiering schemes (vLLM + CPU swap, LMCache + NVMe) assume a single fast path GPU→DRAM→NVMe→S3. They cannot tolerate the full round-trip latency of S3 (10-100ms) in the decode hot path, so S3 is used only for cold-cache rewarming, not online serving. ObjectCache reframes the problem: if decode is layer-sequential and S3 fetches can run *concurrently* with compute, then S3 round-trip latency is hidden so long as the per-layer fetch time is less than the per-layer compute time. For long contexts on a single GPU, this turns out to be true.
Serving 1M+ token contexts on a single commodity GPU with KV-cache stored in S3 instead of DRAM, multi-tenant inference platforms with very large prefix-cache populations that would otherwise need 100+ TB of CPU DRAM, cost-optimized inference where S3 storage (~$0.02/GB/month) replaces local NVMe (~$10/GB upfront).
Recent developments
- ObjectCache paper released. Full design + benchmarks vs LMCache + vLLM baselines published. Per arXiv 2605.22850 — ObjectCache: Layerwise KV-cache persistence to object storage.
- Reference design: layerwise prefetch over a MinIO/S3 endpoint and vLLM runtime. The paper specifies a custom KVConnector that overlaps the next layer's KV fetch with the current layer's attention compute. Per arXiv 2605.22850 — ObjectCache.
- Mooncake provides the connective layer. Moonshot's Mooncake project is the canonical KV-cache-centric serving architecture that ObjectCache-style retrieval extends. Per the Mooncake repo.