Gemma 4 Shared KV Cache
A Gemma-4-specific architectural feature — exposed in HuggingFace `transformers` as the `num_kv_shared_layers` config field — that causes the last *k* transformer layers to **share** the KV-cache of layer *L-k* rather than maintaining independent caches. This is a **structural** (architecture-level) KV-cache size reduction, distinct from algorithmic compression (quantization, eviction, MLA).
Definition
A Gemma-4-specific architectural feature — exposed in HuggingFace `transformers` as the `num_kv_shared_layers` config field — that causes the last *k* transformer layers to **share** the KV-cache of layer *L-k* rather than maintaining independent caches. This is a **structural** (architecture-level) KV-cache size reduction, distinct from algorithmic compression (quantization, eviction, MLA).
Each transformer layer normally holds its own KV-cache, so the cache grows linearly with layer count. Empirical observation in the Gemma research line: the upper layers of a deep transformer carry highly redundant attention patterns over recent tokens. Sharing their cache with a single mid-network layer trades a tiny quality loss for a near-proportional cache-size reduction in the shared band (4-8 layers typical). This makes longer contexts feasible at the same memory budget on the same hardware.
On-device inference (Gemma 4 1B/4B/9B on mobile and laptop NPUs, where every MB of KV-cache matters), long-context Gemma-4-27B serving on single-GPU footprints, edge inference scenarios where structural reduction stacks with FP8 quantization for compounded gains.
Recent developments
Gemma3nConfigexposesnum_kv_shared_layers. The HuggingFacetransformerslibrary surfaces the shared-KV-layer configuration through the Gemma 3n config class, with the parameter documented in the transformers source. Sebastian Raschka's 2026 architecture walkthrough analyses the design in detail.- Sebastian Raschka — Big LLM Architecture Comparison. The 2026 architecture-comparison post unpacks how shared-KV layers stack with FP8 quantization for compounded memory reduction, and where structural sharing wins over algorithmic compression at low layer counts. Per Sebastian Raschka — Big LLM architecture comparison 2026.
- Adopted into vLLM + TensorRT-LLM serving paths. Both runtimes ship Gemma serving layouts that respect
num_kv_shared_layers; naive serving (treating each layer as independent) leaves measurable memory on the table. Per the vLLM repo.
Connections 4
Outbound 4
scoped_to1is_a1competes_with1solves1