Definition

What it is

GQA → MLA migration without retraining from scratch.

Recent developments

Latest signals

TransMLA is a NeurIPS 2025 Spotlight. The method migrates existing GQA models to Multi-Head Latent Attention without retraining from scratch, reporting a 10× speedup at 8K context on LLaMA-2-7B and ~93% KV-cache compression, needing roughly 6B tokens of fine-tuning to recover benchmark accuracy. Runs on vLLM and SGLang. Per the TransMLA paper and the NeurIPS 2025 poster.
Production adoption: Ant Group's Ling-2.5-1T (Feb 2026). The reference implementation lists Ant's trillion-parameter Ling-2.5-1T as a TransMLA adopter — GQA→MLA conversion moving from paper to frontier-scale deployment. The same repo tracks a follow-up line: TPLA (ASPLOS 2026), HISA, MISA, and GQLA (2026).
Attention Editing generalizes the pattern. Attention Editing (2026-04) frames TransMLA and MHA2MLA as instances of a broader post-hoc, cross-architecture attention-conversion framework — converting already-trained LLMs to new attention mechanisms with <10B tokens of recovery. Qwen3-8B-MLA recovers 95.00% on GSM8K vs 95.98% original.
Community converter available. A third-party converter provides SVD initialization, fine-tuning scripts, and MLA-detection tooling for LLaMA-architecture models, with a <2% parameter increase.

Outbound 2

scoped_to1

augments1