TyphoonMLA
A hybrid kernel formulation for DeepSeek-style Multi-head Latent Attention (MLA) introduced in 2026 that interleaves the *naive* (decompressed-head) and *absorbed* (matrix-fused) MLA computation paths within a single attention call, choosing the cheaper path per stage. It is a pure inference-kernel optimization — model weights are unchanged.
Definition
A hybrid kernel formulation for DeepSeek-style Multi-head Latent Attention (MLA) introduced in 2026 that interleaves the *naive* (decompressed-head) and *absorbed* (matrix-fused) MLA computation paths within a single attention call, choosing the cheaper path per stage. It is a pure inference-kernel optimization — model weights are unchanged.
The original DeepSeek-V3 paper showed that MLA can be computed two ways: the **Naive** path (decompress the latent KV → standard multi-head attention) is fast for short sequences but wastes memory; the **Absorbed** path (fuse the up-projection matrices into Q and O) is memory-efficient but slow for short sequences due to large GEMMs. TyphoonMLA picks the better path *per query length* — naive for prefill batches with short sequences, absorbed for decode with long contexts — yielding 30-60% kernel speedup over either path alone.
Serving DeepSeek-V3/V3.1/V4 (and any derivative MLA architecture) in production, mixed-workload inference where prefill and decode coexist on the same GPU, kernel libraries (TensorRT-LLM, vLLM, SGLang) implementing MLA support.
Recent developments
- TyphoonMLA paper published. The hybrid formulation, with full pseudocode and benchmark numbers, appeared in early 2026. Per arXiv 2509.21081 — TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix.
- Merged into vLLM and TensorRT-LLM master. Both runtimes adopted the hybrid path as their default MLA kernel within ~60 days of paper release; the prior single-path implementations are now legacy fallbacks. Per vLLM PR — TyphoonMLA kernel.
- Generalized to non-DeepSeek MLA variants. The paper's hybrid-selection logic is architecture-agnostic; it has been adapted for Qwen 4's MLA-lite formulation and the Bytedance Doubao architecture. Per SGLang docs — MLA backends.
Connections 7
Outbound 6
scoped_to1accelerates1integrates_with2solves1Inbound 1
enables1