Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

LongLive 2.0 - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

LongLive 2.0

NVIDIA LabsApache-2.0

View on GitHub

Multimodal1.8K Stars169 Forks60 views

LongLive 2.0 is NVIDIA Labs' open-source infrastructure for training and serving autoregressive long-video generation models. Released under Apache 2.0 with 1,832 GitHub stars and 169 forks days after launch, it tackles the part of video generation that almost every prior open-source release punted on: making long, multi-shot clips actually feasible to train and run on commodity hardware. The flagship LongLive-2.0-5B model reaches 24.8 frames per second of generation throughput at an 85.06 VBench score, and the heavily quantized variants push past 45 FPS, putting open weights within striking distance of the closed video models that defined 2025. ## NVFP4 Quantization End-to-End The headline technical bet is NVFP4, NVIDIA's new 4-bit floating-point format introduced with Blackwell hardware. LongLive applies it not only to weights and activations (W4A4) but also to the KV cache, which is where most prior 4-bit video pipelines fell apart because cache precision determines temporal coherence across hundreds of frames. The repository ships kernels that achieve 29.7 to 45.7 FPS on quantized 5B models with 2-step distillation, which is roughly 2x faster than the equivalent FP16 baseline while staying within a single percentage point of the original VBench score. ## Balanced Sequence Parallelism For training, LongLive introduces a balanced sequence parallelism scheme specifically designed for autoregressive video. Naive sequence parallelism creates load imbalance because later frames depend on more context than earlier ones, idling GPUs near the tail of every batch. The balanced variant redistributes work so each rank does comparable amounts of attention computation, recovering most of the throughput gap. Combined with TriAttention integration, attention sinks, KV-cache compression, and relative position embeddings, the system targets theoretically infinite video lengths by streaming generation rather than holding the full latent volume in memory. ## Multi-Shot Generation Most prior open video models generate a single continuous clip and break when asked to cut between shots. LongLive trains on continuous multi-shot sequences with explicit shot-change conditioning, then uses multi-shot attention masking at inference so each new shot can attend to relevant past context without bleeding visual artifacts across cuts. The result is generations that more closely resemble actual short-form video content with deliberate edits, rather than the dreamy single-take aesthetic of earlier diffusion-based systems. ## Model Range and Hardware Requirements The repository releases checkpoints from 1.3B to 5B parameters, all under Apache 2.0. The 1.3B model targets single-GPU consumer Blackwell cards, while the 5B model expects an H100 or B200 for full FP16 inference and a smaller card for quantized inference. Training recipes assume 8x H100 or equivalent for the smaller models and scale linearly to larger node counts using the included parallelism configurations. Async decoding and streaming VAE support keep peak memory bounded so long generations do not OOM the GPU regardless of clip duration. ## Honest Limits LongLive is infrastructure first and foremost. The released checkpoints are research-grade rather than production tuned, the resolution ceiling is more modest than commercial offerings, and prompt adherence still trails the best closed video models on complex multi-character scenes. There is no built-in safety filter or watermarking, so deployers need to add those. But for anyone training their own video model or trying to integrate open video generation into a real product, LongLive is now the obvious open baseline for serious long-form work.

Key Features

NVFP4 W4A4 quantization including KV cache for 29.7-45.7 FPS on 5B models
Balanced sequence parallelism for autoregressive video training without GPU load imbalance
Multi-shot training with explicit shot-change conditioning and multi-shot attention masking
Model checkpoints from 1.3B to 5B parameters under Apache 2.0
TriAttention integration plus attention sinks and KV-cache compression for theoretically infinite video lengths
Streaming VAE and async decoding keep peak memory bounded for long generations
LongLive-2.0-5B reaches 24.8 FPS with 85.06 VBench score at full precision
2-step distillation recipes for fast inference without retraining