Trending

TensorRT-LLM

NVIDIAApache-2.0

Inference13.3K Stars2.2K Forks137 views

TensorRT-LLM is NVIDIA's open-source library for optimizing Large Language Model inference on NVIDIA GPUs. It provides an easy-to-use Python API to define LLMs and applies state-of-the-art optimizations including custom attention kernels, inflight batching, paged KV caching, and quantization techniques such as FP8, FP4, INT4 AWQ, and INT8 SmoothQuant. Built on top of TensorRT and PyTorch, it delivers industry-leading throughput and latency for production LLM serving with support for MoE models and Blackwell GPU architectures.

Key Features

State-of-the-art quantization support: FP8, FP4, INT4 AWQ, INT8 SmoothQuant
Inflight batching and paged KV caching for high-throughput serving
Custom CUDA attention kernels for maximum GPU utilization
Mixture-of-Experts (MoE) model support with tensor parallelism
Blackwell GPU architecture support with speculative decoding

Open Source

TensorRT-LLM

Key Features

Tags

Related Projects

Ollama

llama.cpp

vLLM

Unsloth