TensorRT-LLM
NVIDIA's optimized LLM inference framework built on TensorRT, providing hand-tuned CUDA kernels, in-flight batching, paged KV-cache, FP8 / FP4 / INT4 quantization, speculative decoding, and structured-output decoding. It is the highest-performance commercial path for serving LLMs on NVIDIA GPUs, particularly for Hopper (H100/H200) and Blackwell (B100/B200/GB200) hardware.
Definition
NVIDIA's optimized LLM inference framework built on TensorRT, providing hand-tuned CUDA kernels, in-flight batching, paged KV-cache, FP8 / FP4 / INT4 quantization, speculative decoding, and structured-output decoding. It is the highest-performance commercial path for serving LLMs on NVIDIA GPUs, particularly for Hopper (H100/H200) and Blackwell (B100/B200/GB200) hardware.
vLLM optimizes for *generality* across hardware; TensorRT-LLM optimizes for *peak throughput on NVIDIA silicon* by collapsing the kernel stack into hardware-specific paths (TMA descriptors, async warp specialization, FP8 tensor cores, NVLink-aware all-reduces). The cost is a longer model-compile cycle (engine builds) and reduced portability; the benefit is 1.5-3x throughput over vLLM on equivalent hardware for hot model families.
High-throughput production inference for Llama/Mixtral/DeepSeek/Qwen model families on NVIDIA GPUs, batched serving with strict TPOT (time-per-output-token) targets, structured-output workloads (JSON generation, function calling) where the speculative-decoding stack is decisive, multi-GPU serving via NVLink with tensor-parallel + pipeline-parallel sharding.
Recent developments
- Native KV-cache offload to host CPU + NVMe. Tiered KV-cache management mirrors LMCache's HBM → host DRAM → NVMe spill pattern; prefix caches are re-warmed on demand. See the TensorRT-LLM repo for the kernel-level implementation.
- Disaggregated serving via Triton. Prefill-decode disaggregation is supported through Triton Inference Server's executor pools; prefill engines on one accelerator class, decode engines on another, with NIXL shipping caches between them. Per the Triton Inference Server docs.
- Decoupled RoPE + MLA kernels merged. TensorRT-LLM added native kernels for DeepSeek's MLA + Decoupled RoPE pattern, removing the prior performance penalty for serving DeepSeek family models. Per the TensorRT-LLM repo.
Connections 7
Outbound 5
scoped_to1integrates_with1enables1competes_with1solves1Inbound 2
competes_with1integrates_with1