Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction Object segmentation and tracking in video has long been a two-step problem: first, detect what you care about; then, follow it across frames. These steps were historically performed by separate models with incompatible interfaces, requiring bespoke glue code and careful post-processing. Grounded SAM 2, developed by IDEA-Research, collapses this pipeline into a unified, text-driven system capable of finding, segmenting, and tracking any object in a video using only a natural language description. Built on top of Meta's Segment Anything Model 2 (SAM 2) and IDEA-Research's own Grounding DINO family of open-vocabulary detectors, Grounded SAM 2 has accumulated over 3,300 stars since its release and has become a go-to foundation for video understanding research and production applications. The project's philosophy emphasizes simplicity: rather than introducing new model weights, it assembles existing state-of-the-art models into a clean, composable pipeline with minimal implementation overhead. ## Architecture and Design The system is structured as a three-stage pipeline: ### Stage 1: Open-Vocabulary Detection Grounded SAM 2 supports multiple detector backends for the initial localization step: | Detector | Type | Access | |----------|------|--------| | Grounding DINO | Open-source, local | Free | | Grounding DINO 1.5 / 1.6 | API-based, higher accuracy | Cloud API | | Florence-2 | Open-source, local | Free | | DINO-X | API-based, strongest generalization | Cloud API | Each detector takes a natural language prompt (e.g., `"person carrying a red bag"`) and returns bounding boxes with confidence scores. Florence-2 is particularly noteworthy as it supports dense region captioning — it can describe what it sees without requiring a specific query, enabling fully automated annotation workflows. ### Stage 2: Instance Segmentation with SAM 2 Detected bounding boxes are fed as prompts to SAM 2, which produces pixel-precise segmentation masks. SAM 2's architecture extends the original SAM with a streaming memory mechanism that maintains object state across video frames, enabling robust tracking even through occlusions, fast motion, and scene changes. The SAM 2.1 update, supported in this repository, brought significant improvements in mask quality on challenging boundaries and thin structures — critical for applications like human pose analysis and medical imaging. ### Stage 3: Visualization via Supervision The framework integrates with Roboflow's `supervision` library for rich annotation rendering, supporting: - Bounding box overlays with class labels - Colored segmentation mask overlays - Track ID persistence across frames - Export to standard annotation formats (COCO JSON, YOLO) ## Key Capabilities ### Image Grounding and Segmentation For single-image tasks, the pipeline accepts a text prompt and returns annotated images with bounding boxes and masks for all matching objects. A typical use case is automated dataset labeling: given an unlabeled image collection and a class list, Grounded SAM 2 can produce COCO-format annotations in a fraction of the time required for manual annotation. ### Video Object Tracking The most powerful capability is end-to-end video tracking from text prompts. The workflow: 1. User provides a video file and text description of the target object 2. Grounding DINO locates the object in the first frame 3. SAM 2 segments it and initializes its tracking state 4. SAM 2's memory module propagates the mask across subsequent frames 5. Visualization renders the tracked object with consistent ID coloring This workflow has proven particularly valuable for sports analytics, wildlife monitoring, and retail foot traffic analysis — domains where manual video annotation is prohibitively expensive. ### High-Resolution Inference with SAHI A key limitation of most object detection systems is degraded performance on very large images with small objects. Grounded SAM 2 integrates SAHI (Slicing Aided Hyper Inference), which: - Tiles high-resolution images (e.g., 4K, 8K) into overlapping patches - Runs detection independently on each patch - Merges results with NMS for final predictions This makes the system viable for satellite imagery analysis, microscopy, and drone footage — applications where target objects may occupy fewer than 32×32 pixels in a full-resolution frame. ### Auto-Labeling Pipeline The Florence-2 integration enables a zero-shot auto-labeling capability: Florence-2 generates dense captions for image regions, which are then used as grounding queries to SAM 2. The resulting masks can be exported directly to training datasets, closing the loop between unlabeled data collection and model training without any human annotation. ## Developer Integration Installation requires SAM 2 and the appropriate detector: ```bash pip install torch torchvision pip install git+https://github.com/IDEA-Research/Grounded-SAM-2 pip install supervision ``` A minimal image segmentation example: ```python from grounded_sam2 import GroundedSAM2 model = GroundedSAM2(detector="grounding_dino", sam_variant="sam2.1_hiera_large") results = model.predict(image="photo.jpg", text="person . car . bicycle") results.visualize(output_path="annotated.jpg") ``` For video tracking, results are output as annotated video files with frame-level JSON metadata containing object IDs, class labels, bounding boxes, and mask RLEs. ## Performance Characteristics | Task | Speed (A100) | Memory | |------|-------------|--------| | Image segmentation (Grounding DINO + SAM 2 Large) | ~2 FPS | ~10 GB | | Video tracking (1080p, SAM 2 Large) | ~8 FPS | ~8 GB | | High-res with SAHI (4K, 4×4 tiles) | ~0.4 FPS | ~12 GB | ## Limitations - **Grounding accuracy ceiling**: Open-vocabulary detection accuracy degrades for highly specific, fine-grained queries (e.g., distinguishing specific dog breeds) compared to trained class-specific detectors. - **API dependency for best accuracy**: The highest-performing detectors (Grounding DINO 1.5/1.6, DINO-X) require cloud API access with associated latency and cost. - **SAM 2 tracking drift**: On very long videos (>500 frames) or scenes with severe occlusion, the tracking memory can drift, requiring re-initialization. - **No multi-object re-identification**: Objects that leave and re-enter the frame are assigned new track IDs rather than being re-identified. ## Who Should Use This Grounded SAM 2 is well-suited for: - **Computer vision researchers** building video understanding benchmarks who need flexible, zero-shot annotation capabilities - **ML engineers** creating training datasets at scale without manual labeling - **Robotics developers** building perception systems that need to track objects specified via natural language commands - **Analysts** in sports, retail, or security domains who need to extract object trajectories from video without writing custom tracking code For anyone working on video AI in 2026, Grounded SAM 2 represents the most accessible path to production-quality, text-driven object tracking without training a single new model.