Architecture

ObjectCache

A research-prototype architecture for **layerwise persistence of LLM KV-cache to S3-compatible object storage**, exploiting the observation that decoder-only transformer layers can be retrieved on demand during decode if the retrieval is pipelined with the prior layer's attention compute. ObjectCache stores each layer's KV slice as an independent object in S3, and the inference runtime fetches layer *i+1* while attention on layer *i* is in flight, hiding object-store latency behind GPU compute.

6 connections 1 post

Definition

What it is

A research-prototype architecture for **layerwise persistence of LLM KV-cache to S3-compatible object storage**, exploiting the observation that decoder-only transformer layers can be retrieved on demand during decode if the retrieval is pipelined with the prior layer's attention compute. ObjectCache stores each layer's KV slice as an independent object in S3, and the inference runtime fetches layer *i+1* while attention on layer *i* is in flight, hiding object-store latency behind GPU compute.

Why it exists

Existing KV-cache tiering schemes (vLLM + CPU swap, LMCache + NVMe) assume a single fast path GPU→DRAM→NVMe→S3. They cannot tolerate the full round-trip latency of S3 (10-100ms) in the decode hot path, so S3 is used only for cold-cache rewarming, not online serving. ObjectCache reframes the problem: if decode is layer-sequential and S3 fetches can run *concurrently* with compute, then S3 round-trip latency is hidden so long as the per-layer fetch time is less than the per-layer compute time. For long contexts on a single GPU, this turns out to be true.

Primary use cases

Serving 1M+ token contexts on a single commodity GPU with KV-cache stored in S3 instead of DRAM, multi-tenant inference platforms with very large prefix-cache populations that would otherwise need 100+ TB of CPU DRAM, cost-optimized inference where S3 storage (~$0.02/GB/month) replaces local NVMe (~$10/GB upfront).

Recent developments

Latest signals

Connections 6

Outbound 6
stores_in1
integrates_with2

Featured in