Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Grounding DINO is an open-set object detector from IDEA-Research that finds objects described by arbitrary natural-language text instead of a fixed list of categories. Released as the official implementation of the ECCV 2024 paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection," it has become one of the most widely referenced open-vocabulary detection projects on GitHub, with more than 10,000 stars. The repository ships PyTorch code and pretrained weights, and the model is also integrated into Hugging Face Transformers for convenient use. ## How It Works The core idea is in the title: it marries the Transformer-based DINO detector with grounded pre-training that aligns image regions with language. A user supplies an image and a text prompt — a list of class names or a free-form phrase — and the model returns bounding boxes for the matching objects. Because detection is driven by text rather than a closed label set, the same model can locate categories it was never explicitly trained to detect, which is the defining property of open-set (open-vocabulary) detection. ## Capabilities Grounding DINO reports strong zero-shot results on standard benchmarks, including COCO, LVIS, and ODinW (Object Detection in the Wild), and supports referring expression comprehension, where objects are selected by descriptive phrases rather than single nouns. This flexibility makes it useful well beyond classic detection: a common pattern is automated data labeling, where the model proposes boxes from text prompts that humans then verify, dramatically reducing annotation cost. ## Ecosystem The project sits at the center of a broader toolchain. It is the detection front end for Grounded SAM and Grounded SAM 2, which pair it with Meta's Segment Anything models to turn text prompts into segmentation masks and open-world object tracking. The team also released Grounding DINO 1.5 as a more capable successor, and the original model is available through Hugging Face, Colab demos, and Roboflow tutorials, lowering the barrier to experimentation. ## Considerations The public repository reflects a research codebase: setup involves building CUDA extensions, and the main branch has not seen frequent updates since the 1.5 line and Hugging Face integration arrived, so many users now access the model through Transformers instead. As an open-vocabulary detector, accuracy varies with how prompts are phrased, and very fine-grained or ambiguous descriptions can be hit or miss. Even so, for teams that need flexible, prompt-driven detection — or a foundation for segmentation and tracking pipelines — Grounding DINO remains a landmark, Apache-2.0 licensed reference implementation.