Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

Magma - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

Magma

microsoftMIT

View on GitHub

Multimodal1.9K Stars157 Forks101 views

## Magma: Microsoft's Foundation Model for Multimodal AI Agents ### Introduction The gap between AI systems that can understand visual content and AI systems that can take actions in the world has been one of the most persistent challenges in multimodal AI research. Vision-language models excel at describing what they see but struggle to translate that understanding into concrete actions like clicking a UI button, navigating a webpage, or guiding a robotic arm. Magma, developed by Microsoft Research and presented at CVPR 2025, bridges this gap by providing a single foundation model that perceives the world through images and video while generating goal-driven actions across both digital interfaces and physical environments. With 1,900 GitHub stars and the full weight of Microsoft Research behind it, Magma represents a significant step toward truly agentic multimodal AI. Built on Llama-3-8B-Instruct as its language backbone, the model introduces novel pretraining techniques that enable it to learn action capabilities from unlabeled videos at scale, a breakthrough that dramatically reduces the cost and complexity of training agentic models. ### Feature Overview **1. Unified Digital and Physical World Agents** Magma's most distinctive capability is its ability to operate across both digital and physical domains within a single model. In the digital world, it can navigate web interfaces, interact with UI elements, and complete multi-step tasks on desktop and mobile screens. In the physical world, it generates visual plans for robotic manipulation tasks. This dual capability is unusual; most existing models are specialized for either UI interaction or robotics, but Magma handles both through a shared representation that connects visual understanding with action prediction. **2. Set-of-Mark (SoM) and Trace-of-Mark (ToM) Pretraining** The key technical innovation in Magma is its use of two auxiliary pretraining tasks. Set-of-Mark (SoM) teaches the model to identify and reference specific visual elements (buttons, icons, objects) by overlaying numbered markers on images, creating a grounding mechanism between visual regions and textual references. Trace-of-Mark (ToM) extends this to videos by tracking how objects and points of interest move through temporal sequences, enabling the model to understand motion trajectories and predict action paths. These tasks serve as a bridge between the text modality (which the language backbone handles natively) and the action modality (which requires spatial-temporal reasoning). **3. Scalable Training from Unlabeled Videos** Traditional approaches to training agentic models require expensive labeled demonstration data showing correct actions for specific tasks. Magma circumvents this bottleneck by extracting action supervision signals from unlabeled videos in the wild. Using motion tracking (Co-Tracker), the system identifies object trajectories in ordinary videos, filters out camera motion, and converts the remaining object movements into action training data. This approach makes Magma's training pipeline orders of magnitude more scalable than demonstration-dependent methods, as unlabeled video data is abundantly available. **4. State-of-the-Art Agentic Performance** Magma achieves state-of-the-art performance on UI navigation benchmarks (Mind2Web for web navigation, AITW for Android tasks), robotics manipulation tasks (using Open-X-Embodiment data), and competitive results on generic vision-language benchmarks. Importantly, the model demonstrates strong spatial understanding, a capability that emerges from the SoM/ToM pretraining rather than being explicitly supervised. This spatial reasoning ability transfers across domains, benefiting both UI grounding and robotic planning. **5. Open Weights and Comprehensive Release** Microsoft has released Magma with full model weights on Hugging Face and Azure AI Foundry, complete training and inference code on GitHub, and annotated datasets (Mind2Web and AITW with SoM prompting annotations). The MIT license enables unrestricted commercial and research use. The release also includes demo applications for UI agents, gaming agents, and robot visual planning, making it straightforward for researchers and developers to build on the model. ### Usability Analysis Magma can be loaded using standard HuggingFace Transformers patterns with support for bitsandbytes quantization to reduce memory requirements. The model accepts chat-formatted inputs with special image tokens, making the API familiar to anyone who has worked with vision-language models. The included demo scripts cover common use cases: UI element grounding from screenshots, visual question answering, and multi-image reasoning. The setup requirements are moderate: a custom branch of Transformers (version 4.49.0+), PyTorch with CUDA support, and Co-Tracker for motion extraction during data preprocessing. The model runs on a single A100 GPU for inference, with quantized variants accessible on consumer hardware. The main usability challenge is that Magma is a foundation model, not a plug-and-play agent. Deploying it for a specific UI automation task requires additional engineering to handle environment interaction, action execution, and task-specific prompting. The demos provide a starting point, but production deployment requires meaningful integration work. ### Pros and Cons **Pros** - Unified model for both digital (UI navigation) and physical (robotics) agentic tasks - Novel SoM/ToM pretraining enables action learning from unlabeled video at scale - State-of-the-art performance on UI navigation and robotics benchmarks - Full open release: MIT license, model weights, training code, and annotated datasets - Strong spatial reasoning capability that transfers across domains - CVPR 2025 acceptance validates the technical contribution **Cons** - Requires significant integration work to deploy as a production agent system - 8B parameter model requires GPU inference (no efficient CPU or mobile path yet) - UI navigation performance depends heavily on screenshot quality and resolution - Robotics capabilities are demonstrated in simulation; real-world deployment needs additional engineering ### Outlook Magma points toward a future where a single foundation model can serve as the perception and planning backbone for diverse agentic applications. The SoM/ToM pretraining paradigm is particularly significant because it decouples agentic capability from expensive labeled demonstrations, suggesting a scalable path to more capable agents as more video data and compute become available. Microsoft's investment in open-releasing the full training pipeline and datasets signals that Magma is intended as a platform for the research community rather than a one-off demonstration. We can expect derivative models, fine-tuned variants for specific domains (customer service automation, warehouse robotics, game testing), and integration into Microsoft's broader AI agent ecosystem. ### Conclusion Magma is a landmark contribution to multimodal AI agents. By unifying visual understanding with action generation across digital and physical worlds, and by demonstrating that agentic capabilities can be learned from unlabeled video at scale, it lowers the barrier to building AI systems that do not just see and speak, but act. For researchers and developers working on AI agents, UI automation, or embodied AI, Magma provides both a strong foundation model and an open blueprint for training the next generation of agentic systems.

Key Features

Unified foundation model for both digital (UI navigation) and physical (robotics) agentic tasks
Set-of-Mark (SoM) visual grounding for identifying and referencing UI elements and objects
Trace-of-Mark (ToM) temporal tracking for learning action trajectories from video
Scalable training from unlabeled videos using motion extraction (no labeled demonstrations required)
State-of-the-art performance on Mind2Web, AITW, and robotics benchmarks
Built on Llama-3-8B-Instruct with ConvNext vision encoder
Full open release: MIT license, model weights, training code, and annotated datasets