Definition

What it is

An FP8-native quantization scheme for MLA latent KV-cache, introduced 2026, that quantizes the *latent* tensor (the compressed shared representation that MLA stores instead of full per-head K and V) rather than quantizing per-head K and V independently. Because the latent tensor has a different statistical distribution than raw K/V, naive FP8 quantization on it loses 2-5% accuracy; SnapMLA's calibration recipe brings the loss to <0.5%.

Why it exists

MLA already compresses KV-cache by ~6-10x vs standard MHA via the latent projection. FP8 quantization on top of that would yield another 2x. But naive FP8 hurts MLA quality more than it hurts MHA quality, because the latent space concentrates information density — outliers matter more. SnapMLA adds per-channel scaling and a SmoothQuant-style equalization pass *before* FP8 cast, recovering most of the lost quality.

Primary use cases

Long-context serving of DeepSeek-V3/V4 and other MLA models on memory-constrained GPUs (single H100, A100, B100 deployments serving 1M+ context windows), edge inference of MLA-quantized models on Jetson-class hardware.

Recent developments

Latest signals

SnapMLA paper released. Full calibration recipe, ablation studies vs naive FP8, and reference CUDA kernels published. Per arXiv 2602.10718 — SnapMLA: FP8 quantization for Multi-head Latent Attention.
Stacked with TyphoonMLA in production. TensorRT-LLM 0.18 ships a fused TyphoonMLA + SnapMLA kernel — hybrid path selection + FP8 latent storage in a single kernel — for the DeepSeek-V4 serving template. Per NVIDIA NGC — DeepSeek-V4 serving template.
Adopted by SGLang for DeepSeek serving. SGLang's deepseek-mla-fp8 backend is a SnapMLA implementation. Per SGLang docs — DeepSeek backends.

Connections 4

Outbound 4

scoped_to1

AI Memory Infrastructure

is_a1

Multi-Head Latent Attention (MLA)

compresses1

Multi-Head Latent Attention (MLA)

solves1

Memory Wall

Definition

Recent developments

Connections 4

Featured in