Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Eagle is NVIDIA's open vision-language model family built around data-centric strategies rather than raw parameter scaling. The project has accumulated about 1,500 GitHub stars and now spans the original Eagle mixture-of-encoders work, Eagle 2 with its post-training recipes, Eagle 2.5 for long-context multimodal understanding, and the newest May 2026 release, LocateAnything, a generalist grounding and detection model. Beyond research, Eagle quietly powers several flagship NVIDIA stacks: Nemotron VLM, NeMo Retriever, Cosmos generative models, and the Isaac GR00T humanoid robotics platform. ## A Data-Centric Bet Most modern VLM papers chase larger backbones and larger datasets. Eagle asks a different question: what happens when you keep model size moderate and obsess over data composition, instruction diversity, and post-training pipelines? Original Eagle answered with a mixture-of-encoders architecture that combines CLIP, ConvNeXt, EVA, Pix2Str, and SAM features into a single VLM, letting different encoders contribute where they are strong. The work landed as an ICLR 2025 Spotlight, and Eagle 2.5 followed at NeurIPS 2025. ## Eagle 2.5 and Long Context Eagle 2.5 extends the family to 128K-token context windows, enough to ingest book-length documents or full videos with rich captioning. The 8B variant pairs a Qwen2.5-7B-Instruct backbone with the SigLIP2 400M vision encoder, while smaller and larger Eagle 2 variants span 1B to 34B parameters with progressive capability scaling. The result is a tier of models that can handle dense document QA, video timestamping, and chart reasoning in one stack. ## LocateAnything The May 2026 LocateAnything-3B release is the most concrete product of Eagle's strategy. Built on Qwen2.5-3B-Instruct with a MoonViT vision encoder and a 25K-token context window, it unifies dense object detection, document understanding, GUI grounding, and OCR through a vision-language interface. A parallel box decoding mechanism accelerates bounding box prediction, addressing the long-standing throughput bottleneck of generative detection. Demos show zero-shot ultra-dense pedestrian detection, GUI element grounding for agent automation, and document layout extraction without specialist heads. ## Why It Matters Eagle is interesting not just as a research artifact but as a deployment substrate. NVIDIA explicitly positions it as a platform that supports enterprise intelligence (Nemotron, NeMo Retriever) and Physical AI (Isaac GR00T, Cosmos). That means the data strategies, encoder mixtures, and post-training pipelines used in Eagle are the same ones shipping inside production NVIDIA stacks, which gives external researchers a rare window into how a vertically integrated AI vendor builds its multimodal foundation. ## Open Models and Pragmatic Scaling The range from 1B to 34B parameters is unusual in 2026, when most labs publish only the headline mega-model. It makes Eagle practical for groups that need to fit a capable VLM on a single GPU, or run it on the edge for robotics. The code is Apache 2.0, while model weights ship under CC BY-NC 4.0 or a custom NVIDIA license, so teams should read the model card carefully before commercial use. ## Limitations The heterogeneous licensing across code and weights means Eagle is straightforward to use for research and prototyping but requires more careful review for commercial production. Documentation quality varies across sub-projects, and some pretrained checkpoints depend on toolchains (TensorRT, NeMo) that favor NVIDIA hardware. Long-context inference at 128K tokens is memory-hungry, and operators should expect to either quantize or shard for serious deployments. For pure language tasks, dedicated LLMs still outperform Eagle's language head; the project's value is concentrated where vision and language must be jointly reasoned over.