TurboQuant
Near-optimal KV-cache quantization technique from Google Research, published as the ICLR 2026 paper *arXiv 2504.19874*. Combines **PolarQuant** (random-rotation-matrix-then-optimal-scalar-quantization) with **QJL** (Quantized Johnson-Lindenstrauss) into a pipeline that compresses KV cache vectors with mathematically provable near-optimal distortion. **Data-oblivious** — requires no calibration data, no fine-tuning, no per-model setup. Empirical numbers: **6× KV cache memory reduction, up to 8× attention compute speedup** vs 32-bit keys, with **100% recall on Needle-In-A-Haystack up to 104,000 tokens**.
Definition
Near-optimal KV-cache quantization technique from Google Research, published as the ICLR 2026 paper *arXiv 2504.19874*. Combines **PolarQuant** (random-rotation-matrix-then-optimal-scalar-quantization) with **QJL** (Quantized Johnson-Lindenstrauss) into a pipeline that compresses KV cache vectors with mathematically provable near-optimal distortion. **Data-oblivious** — requires no calibration data, no fine-tuning, no per-model setup. Empirical numbers: **6× KV cache memory reduction, up to 8× attention compute speedup** vs 32-bit keys, with **100% recall on Needle-In-A-Haystack up to 104,000 tokens**.
Even after MLA's 64× KV-cache footprint reduction, serving long-context LLMs at scale remains memory-bound. Per-token KV vectors dominate VRAM in production inference, and the existing quantization options (INT8 / INT4 per-channel, AWQ, GPTQ-derived methods) all require some form of calibration data and tend to degrade quality on out-of-distribution prompts. TurboQuant's bet: a data-oblivious technique with provable distortion bounds (3-bit keys + 2-bit values) can be safer for production deployment than calibration-dependent quantization, because there's no "this calibration set drifted from prod" failure mode.
Production LLM inference serving where KV cache dominates VRAM, multi-tenant serving where calibration-per-tenant is operationally impractical, extending context windows by 6× on existing hardware without re-deploying larger GPUs, vLLM / SGLang / TensorRT-LLM forks experimenting with low-bit KV before official upstream support, and as a building block for the next generation of long-context serving stacks.
Recent developments
- Google Research paper at ICLR 2026 (arXiv 2504.19874). Formal publication of the PolarQuant + QJL combination with provable distortion bounds. Per Google's TurboQuant — Nerd Level Tech.
- First open-source implementation by OnlyTerp. First public open-source implementation of TurboQuant with 5× compression, near-zero quality loss. Per GitHub (OnlyTerp/turboquant).
- vLLM fork with TurboQuant support (varjoranta/turboquant-vllm). Fork of vLLM 0.18.1rc1 adding TurboQuant+ KV cache compression: 3.8× smaller KV cache, same conversation quality, fused CUDA kernels with automatic PyTorch fallback. Per GitHub (varjoranta/turboquant-vllm).
- 0xSero/turboquant: 3-bit keys / 2-bit values with Triton kernels + vLLM integration. Alternate implementation pushing the aggressive quantization profile (3-bit keys + 2-bit values) with Triton-based kernels and direct vLLM integration. Per GitHub (0xSero/turboquant).
- Absent from major inference frameworks as of April 2026 — Google official Q2 2026. TurboQuant is absent from official vLLM, TensorRT-LLM, and SGLang as of April 2026; Google's official implementation expected around Q2 2026. An official vLLM feature request is open. Per Kaitchup — TurboQuant finally fast and widely available.
- Empirical 100% NIAH recall up to 104K tokens. Empirical evaluations show flawless 100% recall on "Needle-In-A-Haystack" tests extending up to 104,000 tokens — vastly outperforming traditional eviction methods. Per Medium — KV Cache Revolution.
Connections 4
Outbound 4
scoped_to1solves1enables1alternative_to1