Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

NVIDIA Eagle - Open Source | Evermx | Evermx

Back to Open Source

Trending

NVIDIA Eagle

NVlabsApache-2.0

View on GitHub

Vision1.5K Stars104 Forks108 views

Eagle is NVIDIA's open vision-language model family built around data-centric strategies rather than raw parameter scaling. The project has accumulated about 1,500 GitHub stars and now spans the original Eagle mixture-of-encoders work, Eagle 2 with its post-training recipes, Eagle 2.5 for long-context multimodal understanding, and the newest May 2026 release, LocateAnything, a generalist grounding and detection model. Beyond research, Eagle quietly powers several flagship NVIDIA stacks: Nemotron VLM, NeMo Retriever, Cosmos generative models, and the Isaac GR00T humanoid robotics platform. ## A Data-Centric Bet Most modern VLM papers chase larger backbones and larger datasets. Eagle asks a different question: what happens when you keep model size moderate and obsess over data composition, instruction diversity, and post-training pipelines? Original Eagle answered with a mixture-of-encoders architecture that combines CLIP, ConvNeXt, EVA, Pix2Str, and SAM features into a single VLM, letting different encoders contribute where they are strong. The work landed as an ICLR 2025 Spotlight, and Eagle 2.5 followed at NeurIPS 2025. ## Eagle 2.5 and Long Context Eagle 2.5 extends the family to 128K-token context windows, enough to ingest book-length documents or full videos with rich captioning. The 8B variant pairs a Qwen2.5-7B-Instruct backbone with the SigLIP2 400M vision encoder, while smaller and larger Eagle 2 variants span 1B to 34B parameters with progressive capability scaling. The result is a tier of models that can handle dense document QA, video timestamping, and chart reasoning in one stack. ## LocateAnything The May 2026 LocateAnything-3B release is the most concrete product of Eagle's strategy. Built on Qwen2.5-3B-Instruct with a MoonViT vision encoder and a 25K-token context window, it unifies dense object detection, document understanding, GUI grounding, and OCR through a vision-language interface. A parallel box decoding mechanism accelerates bounding box prediction, addressing the long-standing throughput bottleneck of generative detection. Demos show zero-shot ultra-dense pedestrian detection, GUI element grounding for agent automation, and document layout extraction without specialist heads. ## Why It Matters Eagle is interesting not just as a research artifact but as a deployment substrate. NVIDIA explicitly positions it as a platform that supports enterprise intelligence (Nemotron, NeMo Retriever) and Physical AI (Isaac GR00T, Cosmos). That means the data strategies, encoder mixtures, and post-training pipelines used in Eagle are the same ones shipping inside production NVIDIA stacks, which gives external researchers a rare window into how a vertically integrated AI vendor builds its multimodal foundation. ## Open Models and Pragmatic Scaling The range from 1B to 34B parameters is unusual in 2026, when most labs publish only the headline mega-model. It makes Eagle practical for groups that need to fit a capable VLM on a single GPU, or run it on the edge for robotics. The code is Apache 2.0, while model weights ship under CC BY-NC 4.0 or a custom NVIDIA license, so teams should read the model card carefully before commercial use. ## Limitations The heterogeneous licensing across code and weights means Eagle is straightforward to use for research and prototyping but requires more careful review for commercial production. Documentation quality varies across sub-projects, and some pretrained checkpoints depend on toolchains (TensorRT, NeMo) that favor NVIDIA hardware. Long-context inference at 128K tokens is memory-hungry, and operators should expect to either quantize or shard for serious deployments. For pure language tasks, dedicated LLMs still outperform Eagle's language head; the project's value is concentrated where vision and language must be jointly reasoned over.

Key Features

Mixture-of-encoders VLM architecture combining CLIP, ConvNeXt, EVA, Pix2Str, and SAM
Eagle 2.5 with 128K-token context for long video and document understanding
LocateAnything-3B unifies dense detection, GUI grounding, and OCR via a single VLM
Parallel Box Decoding accelerates bounding box prediction over sequential generation
Models from 1B to 34B parameters covering edge and data center deployment
Backbone for NVIDIA Nemotron VLM, NeMo Retriever, Cosmos, and Isaac GR00T
Apache 2.0 code license with research-friendly model weights
ICLR 2025 Spotlight and NeurIPS 2025 accepted research

Related Projects

TrendingVision

GitHub

108.4K12.6K

ComfyUI

Comfy-Org

GPL-3.0231

Open Source

NVIDIA Eagle

Key Features

Tags

Related Projects

ComfyUI

PaddleOCR

Ultralytics YOLO

Roboflow Supervision