Technology

TensorRT-LLM

NVIDIA's optimized LLM inference framework built on TensorRT, providing hand-tuned CUDA kernels, in-flight batching, paged KV-cache, FP8 / FP4 / INT4 quantization, speculative decoding, and structured-output decoding. It is the highest-performance commercial path for serving LLMs on NVIDIA GPUs, particularly for Hopper (H100/H200) and Blackwell (B100/B200/GB200) hardware.

7 connections 1 post

Definition

What it is

NVIDIA's optimized LLM inference framework built on TensorRT, providing hand-tuned CUDA kernels, in-flight batching, paged KV-cache, FP8 / FP4 / INT4 quantization, speculative decoding, and structured-output decoding. It is the highest-performance commercial path for serving LLMs on NVIDIA GPUs, particularly for Hopper (H100/H200) and Blackwell (B100/B200/GB200) hardware.

Why it exists

vLLM optimizes for *generality* across hardware; TensorRT-LLM optimizes for *peak throughput on NVIDIA silicon* by collapsing the kernel stack into hardware-specific paths (TMA descriptors, async warp specialization, FP8 tensor cores, NVLink-aware all-reduces). The cost is a longer model-compile cycle (engine builds) and reduced portability; the benefit is 1.5-3x throughput over vLLM on equivalent hardware for hot model families.

Primary use cases

High-throughput production inference for Llama/Mixtral/DeepSeek/Qwen model families on NVIDIA GPUs, batched serving with strict TPOT (time-per-output-token) targets, structured-output workloads (JSON generation, function calling) where the speculative-decoding stack is decisive, multi-GPU serving via NVLink with tensor-parallel + pipeline-parallel sharding.

Recent developments

Latest signals
  • Native KV-cache offload to host CPU + NVMe. Tiered KV-cache management mirrors LMCache's HBM → host DRAM → NVMe spill pattern; prefix caches are re-warmed on demand. See the TensorRT-LLM repo for the kernel-level implementation.
  • Disaggregated serving via Triton. Prefill-decode disaggregation is supported through Triton Inference Server's executor pools; prefill engines on one accelerator class, decode engines on another, with NIXL shipping caches between them. Per the Triton Inference Server docs.
  • Decoupled RoPE + MLA kernels merged. TensorRT-LLM added native kernels for DeepSeek's MLA + Decoupled RoPE pattern, removing the prior performance penalty for serving DeepSeek family models. Per the TensorRT-LLM repo.

Connections 7

Outbound 5
Inbound 2
competes_with1
integrates_with1

Featured in