Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

MiMo-V2-Flash - Open Source | Evermx | Evermx

Back to Open Source

Trending

MiMo-V2-Flash

XiaomiMiMoMIT

View on GitHub

LLM1.2K Stars46 Forks153 views

## Introduction MiMo-V2-Flash is Xiaomi's open-source Mixture-of-Experts (MoE) language model featuring 309 billion total parameters with only 15 billion active parameters per forward pass. Released under the XiaomiMiMo organization on GitHub, MiMo-V2-Flash represents a significant milestone in efficient large-scale language modeling — delivering frontier-level reasoning and coding performance while using a fraction of the compute required by dense models of comparable capability. The model ranks as the #1 open-source model on both SWE-bench Verified and SWE-bench Multilingual, making it the current state-of-the-art for open-source software engineering agents. What makes MiMo-V2-Flash particularly significant in 2026 is its position within Xiaomi's broader MiMo-V2 ecosystem. While MiMo-V2-Pro (the 1T+ parameter flagship) and MiMo-V2-Omni (the multimodal variant) remain proprietary, MiMo-V2-Flash was open-sourced to demonstrate that efficient MoE architectures can rival the performance of models many times their effective size. The model was pre-trained on 27 trillion tokens using FP8 mixed precision and further refined through large-scale agentic reinforcement learning on over 100,000 verifiable GitHub tasks. ## Architecture and Design MiMo-V2-Flash introduces several architectural innovations that enable its efficiency-performance balance. | Component | Purpose | Key Characteristics | |-----------|---------|--------------------| | MoE Layer | Expert routing | 309B total params, 15B active per token, sparse activation | | Hybrid Attention | Context processing | 5:1 ratio of Sliding Window to Global Attention, 128-token windows | | Multi-Token Prediction | Fast inference | 0.33B params per MTP block, enables 3x generation speedup | | Attention Sink Bias | Stability | Learnable bias for consistent attention distribution | The **Hybrid Attention Architecture** is MiMo-V2-Flash's most distinctive design choice. Instead of using full global attention across all layers (which scales quadratically with sequence length), MiMo-V2-Flash interleaves Sliding Window Attention (SWA) with Global Attention at a 5:1 ratio. Each SWA layer processes only a 128-token window, while every sixth layer performs full global attention. This reduces KV-cache memory by approximately 6x compared to full attention, enabling the 256K context window to run efficiently on commodity hardware. The **Multi-Token Prediction (MTP)** module adds a lightweight prediction head (0.33B parameters per block) that predicts multiple future tokens simultaneously. During inference, this enables speculative decoding with 3x faster generation speeds. The MTP module is trained alongside the main model, learning to predict token sequences rather than individual tokens. **Multi-Teacher On-Policy Distillation (MOPD)** was used during post-training, where multiple expert teacher models provided dense token-level supervision. This approach transfers reasoning capabilities from larger, more capable models into MiMo-V2-Flash's efficient architecture without sacrificing quality. ## Key Features **#1 Open-Source on SWE-bench**: MiMo-V2-Flash achieves the highest scores among all open-source models on SWE-bench Verified and SWE-bench Multilingual, the industry-standard benchmarks for evaluating AI software engineering capabilities. This makes it the current best open-source choice for coding agents. **6x KV-Cache Reduction**: The hybrid attention architecture with 5:1 SWA-to-Global ratio reduces KV-cache memory by approximately 6x compared to full attention models. This enables serving the model with significantly less GPU memory, making 256K context windows practical on hardware that would typically support only 40K contexts. **3x Faster Generation**: The Multi-Token Prediction module enables speculative decoding that generates tokens 3x faster than standard autoregressive decoding. At 150 tokens per second on optimized infrastructure, MiMo-V2-Flash is among the fastest open-source models at its capability level. **Large-Scale Agentic RL**: The model was post-trained using reinforcement learning on over 100,000 verifiable GitHub tasks with automated verification. The Rollout Routing Replay (R3) system and Request-Level Prefix Cache infrastructure enabled efficient RL training at this unprecedented scale. **FP8 Mixed Precision**: Pre-training on 27 trillion tokens used FP8 mixed precision throughout, demonstrating that reduced-precision training can produce frontier-quality models when combined with careful numerical engineering. ## Code Example Using MiMo-V2-Flash with SGLang (recommended inference engine): ```bash pip install sglang[all] ``` ```python import sglang as sgl # Launch the model with speculative decoding runtime = sgl.Runtime( model_path="XiaomiMiMo/MiMo-V2-Flash", speculative_algorithm="EAGLE", tp_size=4 # 4-GPU tensor parallelism ) sgl.set_default_backend(runtime) @sgl.function def code_agent(s, task): s += sgl.system("You are an expert software engineer.") s += sgl.user(task) s += sgl.assistant(sgl.gen("response", max_tokens=4096)) state = code_agent.run(task="Fix the race condition in this Python async handler...") print(state["response"]) ``` Using the Hugging Face Transformers library: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "XiaomiMiMo/MiMo-V2-Flash", torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("XiaomiMiMo/MiMo-V2-Flash") inputs = tokenizer("Explain the MoE routing mechanism", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Limitations Despite its impressive efficiency, MiMo-V2-Flash requires significant hardware for deployment — the 309B total parameters necessitate multi-GPU setups even with MoE sparsity. The model's license terms should be carefully reviewed, as Xiaomi's open-source licensing for the MiMo series may include usage restrictions not present in standard Apache or MIT licenses. The hybrid attention architecture, while memory-efficient, introduces complexity in the attention pattern that may not be fully supported by all inference frameworks — SGLang is currently the recommended and best-optimized option. The 15B active parameters mean that while inference is efficient, the model still requires loading all 309B parameters into memory across GPUs. Finally, while the model excels at coding and reasoning benchmarks, its performance on creative writing and open-ended conversation tasks may not match dense models specifically optimized for those use cases. ## Who Should Use This MiMo-V2-Flash is ideal for teams building AI coding agents who need the best available open-source model for software engineering tasks. Organizations that require long-context processing (up to 256K tokens) but face GPU memory constraints will benefit from the 6x KV-cache reduction. Researchers studying MoE architectures, hybrid attention mechanisms, or multi-token prediction will find the model and its technical report invaluable reference implementations. Companies seeking an alternative to proprietary models like GPT-5 or Claude Opus for code generation and reasoning tasks should evaluate MiMo-V2-Flash as a self-hosted option. Teams already using SGLang for inference will find the integration particularly seamless, with speculative decoding support delivering the full 3x speed advantage.