Definition

What it is

NVIDIA's optimized LLM inference framework built on TensorRT, providing hand-tuned CUDA kernels, in-flight batching, paged KV-cache, FP8 / FP4 / INT4 quantization, speculative decoding, and structured-output decoding. It is the highest-performance commercial path for serving LLMs on NVIDIA GPUs, particularly for Hopper (H100/H200) and Blackwell (B100/B200/GB200) hardware.

Why it exists

vLLM optimizes for *generality* across hardware; TensorRT-LLM optimizes for *peak throughput on NVIDIA silicon* by collapsing the kernel stack into hardware-specific paths (TMA descriptors, async warp specialization, FP8 tensor cores, NVLink-aware all-reduces). The cost is a longer model-compile cycle (engine builds) and reduced portability; the benefit is 1.5-3x throughput over vLLM on equivalent hardware for hot model families.

Primary use cases

High-throughput production inference for Llama/Mixtral/DeepSeek/Qwen model families on NVIDIA GPUs, batched serving with strict TPOT (time-per-output-token) targets, structured-output workloads (JSON generation, function calling) where the speculative-decoding stack is decisive, multi-GPU serving via NVLink with tensor-parallel + pipeline-parallel sharding.

Recent developments

Latest signals

Latest release: v1.2.1 (April 2026) — the project crossed 1.0. TensorRT-LLM left the 0.x series for a stabilized 1.x release line; the current tag is v1.2.1, a major-version shift from the prior 0.20.0 line (API stabilization + the move off the experimental versioning). Teams pinned to 0.x should plan the migration. Per NVIDIA/TensorRT-LLM releases.
Native KV-cache offload to host CPU + NVMe. Tiered KV-cache management mirrors LMCache's HBM → host DRAM → NVMe spill pattern; prefix caches are re-warmed on demand. See the TensorRT-LLM repo for the kernel-level implementation.
Disaggregated serving via Triton. Prefill-decode disaggregation is supported through Triton Inference Server's executor pools; prefill engines on one accelerator class, decode engines on another, with NIXL shipping caches between them. Per the Triton Inference Server docs.
Decoupled RoPE + MLA kernels merged. TensorRT-LLM added native kernels for DeepSeek's MLA + Decoupled RoPE pattern, removing the prior performance penalty for serving DeepSeek family models. Per the TensorRT-LLM repo.