Trending

NVIDIA Model Optimizer

NVIDIAApache-2.0

Inference2.4K Stars332 Forks92 views

NVIDIA Model Optimizer (formerly TensorRT Model Optimizer) is a unified library of state-of-the-art model optimization techniques including quantization, pruning, distillation, speculative decoding, and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, and vLLM to dramatically optimize inference speed. The library supports highly performant quantization formats including NVFP4, FP8, INT8, and INT4 with advanced algorithms such as SmoothQuant, AWQ, and SVDQuant.

Key Features

Supports NVFP4, FP8, INT8, INT4 quantization formats for maximum inference speed
Advanced algorithms including SmoothQuant, AWQ, SVDQuant, and Double Quantization
Both Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) support
Seamless deployment to TensorRT-LLM, TensorRT, and vLLM frameworks
Pruning, distillation, speculative decoding, and sparsity techniques in one library

Open Source

NVIDIA Model Optimizer

Key Features

Tags

Related Projects

Ollama

llama.cpp

Unsloth

SGLang