Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction SAM 2 (Segment Anything Model 2) is Meta AI Research's foundation model for promptable visual segmentation, extending the original Segment Anything Model from still images to video. Where the first SAM made "segment anything you can point at" a reality for photos, SAM 2 generalizes that capability across time: a single click, box, or mask prompt on one frame propagates to track and segment an object through an entire video clip. Initially released in July 2024 and refined as SAM 2.1 in September 2024, the project has grown to over 19,000 GitHub stars and become a default building block for video annotation tools, object-tracking pipelines, and downstream vision research. ## Architecture SAM 2 is built on a simple transformer architecture augmented with a streaming memory module. Rather than treating each video frame independently, the model maintains a memory bank of past frames and prompts, allowing it to carry an object's identity forward and recover from occlusions. The streaming design processes frames one at a time, which keeps the model real-time capable and removes the need to load an entire clip into memory before inference. For static images, SAM 2 collapses to a single-frame mode that retains full feature parity with the original SAM, including click prompts, box prompts, and automatic mask generation. ## Model Variants The SAM 2.1 release ships four checkpoints that trade accuracy for speed: | Model | Parameters | Speed (FPS) | |-------|-----------|-------------| | Tiny | 38.9M | 91.2 | | Small | 46M | 84.8 | | Base+ | 80.8M | 64.1 | | Large | 224.4M | 39.5 | The Tiny and Small variants are fast enough for interactive, near-real-time use on a single GPU, while the Large checkpoint targets maximum segmentation quality for offline or batch workloads. ## Key Capabilities ### Promptable Video Segmentation A user provides a prompt — a click, bounding box, or mask — on any frame, and SAM 2 propagates the segmentation forward and backward across the clip. Refinements can be added interactively on later frames, and the memory module incorporates them to correct drift. ### Multi-Object Tracking SAM 2 can track multiple objects simultaneously, maintaining a separate memory state per object, which makes it practical for scene-level annotation rather than single-subject cutouts. ### Occlusion Handling Because the memory bank stores appearance information from earlier frames, the model can re-acquire an object after it is briefly hidden or leaves and re-enters the frame. ### Image Parity with SAM In single-frame mode the model retains the full original SAM toolkit, so existing image-segmentation workflows can adopt SAM 2 without losing functionality. ## Training Data SAM 2 was trained with a model-in-the-loop data engine that produced the SA-V dataset, described by the authors as the largest video segmentation dataset to date. The data engine used the model itself to accelerate human annotation, then fed the results back into training in an iterative loop. Meta also released the training code, allowing teams to fine-tune SAM 2 on their own video datasets. ## Benchmarks The SAM 2.1 large model reports strong results across standard video object segmentation benchmarks: 79.5% on the SA-V test set, 74.6% on MOSE validation, and 80.6% on LVOS v2. These numbers reflect the model's ability to maintain accurate masks over long, challenging sequences. ## Deployment and Integration SAM 2 requires Python 3.10+ and PyTorch 2.5.1 or newer with TorchVision. The repository exposes two primary APIs — `SAM2ImagePredictor` for images and `SAM2VideoPredictor` for video — and integrates with the Hugging Face model hub for one-line checkpoint loading. Video inference supports `torch.compile` for a meaningful speedup. The model checkpoints, demo, and training code are released under the permissive Apache-2.0 license, with optional post-processing components under BSD-3-Clause. ## Why It Matters Video segmentation has historically required task-specific models and heavy per-dataset tuning. SAM 2 collapses that into a single promptable foundation model that works zero-shot across domains, turning a research-grade capability into an off-the-shelf tool. For annotation teams, it dramatically reduces the manual effort of labeling video; for product builders, it provides a reliable tracking backbone that can be dropped into editing, robotics, and AR pipelines. ## Limitations SAM 2 segments and tracks objects but does not classify or name them — it answers "where is this object" rather than "what is it," so downstream recognition still needs a separate model. Very fast motion, heavy crowding of visually similar objects, and long-duration occlusions can still cause identity switches or mask drift that require manual correction. The Large checkpoint's accuracy comes at a throughput cost that may not suit strict real-time budgets, pushing latency-sensitive deployments toward the Tiny or Small variants.