Technology

Gemma 4 Shared KV Cache

A Gemma-4-specific architectural feature — exposed in HuggingFace `transformers` as the `num_kv_shared_layers` config field — that causes the last *k* transformer layers to **share** the KV-cache of layer *L-k* rather than maintaining independent caches. This is a **structural** (architecture-level) KV-cache size reduction, distinct from algorithmic compression (quantization, eviction, MLA).

4 connections 1 post

Definition

What it is

A Gemma-4-specific architectural feature — exposed in HuggingFace `transformers` as the `num_kv_shared_layers` config field — that causes the last *k* transformer layers to **share** the KV-cache of layer *L-k* rather than maintaining independent caches. This is a **structural** (architecture-level) KV-cache size reduction, distinct from algorithmic compression (quantization, eviction, MLA).

Why it exists

Each transformer layer normally holds its own KV-cache, so the cache grows linearly with layer count. Empirical observation in the Gemma research line: the upper layers of a deep transformer carry highly redundant attention patterns over recent tokens. Sharing their cache with a single mid-network layer trades a tiny quality loss for a near-proportional cache-size reduction in the shared band (4-8 layers typical). This makes longer contexts feasible at the same memory budget on the same hardware.

Primary use cases

On-device inference (Gemma 4 1B/4B/9B on mobile and laptop NPUs, where every MB of KV-cache matters), long-context Gemma-4-27B serving on single-GPU footprints, edge inference scenarios where structural reduction stacks with FP8 quantization for compounded gains.

Recent developments

Latest signals
  • Gemma3nConfig exposes num_kv_shared_layers. The HuggingFace transformers library surfaces the shared-KV-layer configuration through the Gemma 3n config class, with the parameter documented in the transformers source. Sebastian Raschka's 2026 architecture walkthrough analyses the design in detail.
  • Sebastian Raschka — Big LLM Architecture Comparison. The 2026 architecture-comparison post unpacks how shared-KV layers stack with FP8 quantization for compounded memory reduction, and where structural sharing wins over algorithmic compression at low layer counts. Per Sebastian Raschka — Big LLM architecture comparison 2026.
  • Adopted into vLLM + TensorRT-LLM serving paths. Both runtimes ship Gemma serving layouts that respect num_kv_shared_layers; naive serving (treating each layer as independent) leaves measurable memory on the table. Per the vLLM repo.

Connections 4

Outbound 4

Featured in