Meta Launches Llama 4 Scout and Maverick: First Open-Weight Multimodal MoE Models
Meta releases Llama 4 Scout (17B active, 16 experts, 10M context) and Maverick (17B active, 128 experts), its first natively multimodal mixture-of-experts models.
Meta releases Llama 4 Scout (17B active, 16 experts, 10M context) and Maverick (17B active, 128 experts), its first natively multimodal mixture-of-experts models.
A New Architecture for the Llama Family
Meta has released Llama 4 Scout and Llama 4 Maverick on April 5, 2026, marking the most significant architectural shift in the Llama model family to date. For the first time, Llama models are built on a mixture-of-experts (MoE) architecture and trained as natively multimodal systems from the ground up. Both models can process text, images, and video inputs, and both are available as open-weight downloads on Hugging Face and llama.com.
The release follows months of anticipation after Meta announced it would invest $135 billion in AI infrastructure in 2026. Llama 4 represents the technical outcome of that investment: a model family designed to compete with proprietary offerings from Google, OpenAI, and Anthropic while remaining freely available to the developer community under open-weight terms.
Llama 4 Scout: A 10-Million-Token Context Window on a Single GPU
Llama 4 Scout is the smaller of the two models, with 17 billion active parameters spread across 16 experts and 109 billion total parameters. Its defining feature is an industry-leading context window of 10 million tokens, the longest available in any open-weight model. The base model was pre-trained with a 256K context length using an iRoPE architecture with interleaved attention layers, and the instruct-tuned version extends this to the full 10M tokens.
Despite its total parameter count, Scout is designed for efficient deployment. With Int4 quantization, the entire model fits on a single NVIDIA H100 GPU. This makes it accessible to research labs, startups, and individual developers who lack multi-node GPU clusters.
Meta reports that Scout outperforms Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a broad range of benchmarks, including coding, reasoning, long-context tasks, and image understanding. The 10M context window enables use cases that were previously exclusive to API-only services: full-codebase analysis, book-length document processing, and multi-session conversational memory.
Llama 4 Maverick: 128 Experts Competing With GPT-4o
Llama 4 Maverick scales up the MoE approach dramatically, maintaining the same 17 billion active parameters but distributing computation across 128 experts, with 400 billion total parameters. The model runs on a single H100 DGX host or can be distributed across multiple nodes for higher throughput.
Maverick's benchmark performance positions it as a direct competitor to GPT-4o and Gemini 2.0 Flash. Meta claims it beats both across a broad range of widely reported benchmarks while achieving comparable results to DeepSeek V3 on reasoning and coding tasks at less than half the active parameters. On LMArena, Maverick achieves an ELO rating of 1417.
The instruct-tuned Maverick model supports a 1 million token context window, ten times shorter than Scout but still among the longest available. Its strength lies in multimodal understanding: the model uses an early fusion architecture with an improved vision encoder based on MetaCLIP, enabling superior image grounding and visual reasoning.
Training at Scale: 30 Trillion Tokens Across 200 Languages
Both Scout and Maverick were pre-trained on over 30 trillion tokens, double the training data used for Llama 3. The dataset spans 200 languages, with over 100 languages represented by at least 1 billion tokens each, a tenfold increase in multilingual coverage compared to the previous generation.
Training was conducted at FP8 precision, achieving 390 TFLOPs per GPU. Meta's early fusion approach to multimodality, where text and visual information are processed jointly from the earliest layers rather than bolted on as separate modules, distinguishes Llama 4 from many competitors that add multimodal capabilities through adapters or late-fusion techniques.
The models were trained on diverse text, image, and video datasets, with Scout supporting up to 48 images during pre-training and tested successfully with 8 images in post-training scenarios.
Llama 4 Behemoth: The Teacher Model Still in Training
Meta also disclosed Llama 4 Behemoth, a massive model with 288 billion active parameters across 16 experts and approximately 2 trillion total parameters. Behemoth is still training at the time of the Scout and Maverick release, but Meta reports it already outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks.
Behemoth serves as the teacher model for knowledge distillation into Scout and Maverick. This training approach, where smaller models learn from a much larger teacher, explains why Scout and Maverick punch well above their active parameter count. Meta has not announced a release date for Behemoth, but the upcoming LlamaCon event on April 29 may provide more details.
Integration Across Meta's Ecosystem
Llama 4 is not just an API or download. Meta is deploying these models across its consumer products, powering Meta AI on WhatsApp, Messenger, Instagram Direct, and the meta.ai website. This integration gives Llama 4 immediate access to Meta's billions of monthly active users, creating a deployment scale that no other open-weight model can match.
For developers, the models are available through Hugging Face with open weights that can be fine-tuned for specific applications. The combination of open availability and massive-scale consumer deployment is a dual strategy: Meta benefits from community improvements while using its own products as the largest testing ground.
Competitive Position
Llama 4 enters a market that has become dramatically more competitive since Llama 3's release. Google launched Gemma 4 under Apache 2.0 just days earlier on April 2, and DeepSeek V3 continues to offer strong performance at extremely low cost. OpenAI's proprietary models remain the benchmark for many enterprise customers.
Llama 4's advantages are architectural. The MoE design provides strong performance with relatively few active parameters, making inference cheaper. The 10M context window on Scout has no equal among open-weight models. And the natively multimodal training means these capabilities are not compromises or afterthoughts but core design features.
The main limitation is that Llama 4 uses a community license rather than Apache 2.0, which imposes some restrictions on commercial use and redistribution that Gemma 4 does not have. For organizations that prioritize licensing flexibility, this remains a consideration.
Conclusion
Meta's Llama 4 Scout and Maverick represent a generational leap for open-weight AI models. The combination of MoE architecture, native multimodality, 10M-token context windows, and 200-language support creates a model family that competes directly with the best proprietary offerings while remaining freely downloadable. For developers, researchers, and organizations building on open models, Llama 4 sets a new standard for what is available outside of API-only services.
Pros
- Industry-leading 10M token context window on Scout enables previously impossible long-context applications
- Open-weight availability allows full fine-tuning, inspection, and deployment without API dependencies
- MoE architecture delivers strong performance with efficient resource utilization on standard GPU hardware
- Native multimodal training produces more coherent text-image understanding than bolt-on approaches
- Massive multilingual coverage (200 languages) serves global developer communities
Cons
- Community license is more restrictive than Apache 2.0, limiting some commercial and redistribution use cases
- Maverick's 400B total parameters require a full H100 DGX host, putting it out of reach for smaller teams
- Behemoth is not yet released, meaning the full Llama 4 family is incomplete at launch
- Initial community reports suggest benchmark results may not fully reflect real-world conversational quality
References
Comments0
Key Features
1. Scout: 17B active parameters with 16 experts (109B total), industry-leading 10M token context window, fits on single H100 GPU with Int4 quantization 2. Maverick: 17B active parameters with 128 experts (400B total), beats GPT-4o and Gemini 2.0 Flash on benchmarks, 1M token context window 3. Natively multimodal MoE architecture with early fusion for text, image, and video processing from the earliest layers 4. Pre-trained on 30+ trillion tokens across 200 languages (10x multilingual coverage over Llama 3) 5. Behemoth teacher model (288B active, ~2T total) outperforms GPT-4.5 and Claude Sonnet 3.7 on STEM benchmarks
Key Insights
- The 10M token context window on Scout is the longest available in any open-weight model, enabling full-codebase analysis and book-length document processing
- MoE architecture with 17B active parameters achieves performance comparable to models with 2-3x the active compute budget
- Native multimodality through early fusion gives Llama 4 structural advantages over competitors using adapter-based approaches
- Training on 30+ trillion tokens across 200 languages makes Llama 4 the most linguistically diverse open model family
- Deployment across WhatsApp, Messenger, and Instagram gives Llama 4 immediate access to billions of users for real-world testing
- The Behemoth teacher model with 2 trillion total parameters demonstrates Meta's investment in knowledge distillation at unprecedented scale
- Maverick achieving GPT-4o-level performance at less than half the active parameters signals a shift toward efficiency-first model design
