ObjectCache | LLMS3

Definition

What it is

A research-prototype architecture for **layerwise persistence of LLM KV-cache to S3-compatible object storage**, exploiting the observation that decoder-only transformer layers can be retrieved on demand during decode if the retrieval is pipelined with the prior layer's attention compute. ObjectCache stores each layer's KV slice as an independent object in S3, and the inference runtime fetches layer *i+1* while attention on layer *i* is in flight, hiding object-store latency behind GPU compute.

Why it exists

Existing KV-cache tiering schemes (vLLM + CPU swap, LMCache + NVMe) assume a single fast path GPU→DRAM→NVMe→S3. They cannot tolerate the full round-trip latency of S3 (10-100ms) in the decode hot path, so S3 is used only for cold-cache rewarming, not online serving. ObjectCache reframes the problem: if decode is layer-sequential and S3 fetches can run *concurrently* with compute, then S3 round-trip latency is hidden so long as the per-layer fetch time is less than the per-layer compute time. For long contexts on a single GPU, this turns out to be true.

Primary use cases

Serving 1M+ token contexts on a single commodity GPU with KV-cache stored in S3 instead of DRAM, multi-tenant inference platforms with very large prefix-cache populations that would otherwise need 100+ TB of CPU DRAM, cost-optimized inference where S3 storage (~$0.02/GB/month) replaces local NVMe (~$10/GB upfront).

Recent developments

Latest signals

ObjectCache paper released. Full design + benchmarks vs LMCache + vLLM baselines published. Per arXiv 2605.22850 — ObjectCache: Layerwise KV-cache persistence to object storage.
Reference design: layerwise prefetch over a MinIO/S3 endpoint and vLLM runtime. The paper specifies a custom KVConnector that overlaps the next layer's KV fetch with the current layer's attention compute. Per arXiv 2605.22850 — ObjectCache.
Mooncake provides the connective layer. Moonshot's Mooncake project is the canonical KV-cache-centric serving architecture that ObjectCache-style retrieval extends. Per the Mooncake repo.
Layerwise delivery validated S3 as a hot KV-cache tier (June 2026 wave). A compact S3 protocol descriptor enables multi-object batching plus layerwise delivery — KV blocks streamed in layer-execution order — so on 100 Gbps RoCE with DAOS/Ceph RGW it adds only ~5.6% latency vs local DRAM for 64K-token contexts. Per arXiv 2605.22850 — ObjectCache.

Connections 6

Outbound 6

scoped_to2

AI Memory Infrastructure S3

stores_in1

S3

integrates_with2

vLLM MinIO

solves1

Memory Wall

Definition

Recent developments

Connections 6

Featured in