Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction NEO is a groundbreaking series of native vision-language models built from first principles by the EvolvingLMMs Lab. Accepted at ICLR 2026, NEO challenges the dominant modular VLM paradigm by unifying pixel-word encoding, alignment, and reasoning within a single dense, monolithic architecture. Rather than bolting a vision encoder onto a language model, NEO processes visual and linguistic information through unified computational pathways, achieving competitive performance with substantially less training data. The project addresses two fundamental questions in the VLM space: what constraints distinguish native VLMs from modular ones, and how to make native VLMs more accessible to the research community. With model weights, evaluation code, and training frameworks all released under the Apache 2.0 license, NEO represents a significant step toward democratizing native multimodal AI research. ## Architecture and Design NEO's most significant architectural innovation is its abandonment of the encoder-decoder paradigm that dominates contemporary VLMs. In conventional systems like LLaVA or InternVL, a separate vision encoder (typically a ViT variant) processes images into feature vectors, which are then projected into the language model's embedding space through an adapter layer. This modular approach introduces architectural complexity, alignment challenges, and information bottlenecks at the adapter boundary. NEO instead implements a native VLM primitive where visual and textual tokens are processed through the same transformer backbone from the start. Key architectural decisions include: | Component | Design Choice | Advantage | |-----------|--------------|----------| | Input Processing | Any-resolution support | No tile-based workarounds needed | | Position Encoding | Native RoPE variant | Optimized for integrated visual-linguistic tokens | | Base LLM | Qwen3 series | Strong language foundation (1.7B and 8B) | | Training Data | ~390M image-text pairs | 10x less than comparable modular models | | Architecture | Dense monolithic transformer | Eliminates encoder-adapter bottleneck | This unified design eliminates the information loss that occurs at module boundaries in traditional VLMs, enabling more natural visual-linguistic reasoning. ## Model Variants and Performance NEO releases models at two scales, each with three training checkpoints: | Model | Base LLM | Parameters | MMMU | MMB | DocVQA | OCRBench | |-------|----------|-----------|------|-----|--------|----------| | NEO-2B-SFT | Qwen3-1.7B | ~2B | 48.6 | 76.0 | 89.9 | 79.2 | | NEO-8B-SFT | Qwen3-8B | ~9B | 54.6 | 82.1 | 88.6 | 82.4 | The training follows a three-stage methodology: 1. **Pre-Training (PT)**: Foundation building with 345M image-text pairs, establishing basic visual-linguistic associations across the unified architecture. 2. **Mid-Training (MT)**: Alignment refinement using 40M curated examples, sharpening the model's ability to connect visual concepts with language understanding. 3. **Supervised Fine-Tuning (SFT)**: Task-specific optimization with 4M high-quality instruction-following examples, enabling the model to follow complex visual reasoning instructions. Across 17 evaluation benchmarks, NEO demonstrates remarkable data efficiency. The NEO-2B variant achieves performance comparable to models trained on orders of magnitude more data. For comparison, InternVL3 uses over 6 billion pre-training examples, while NEO achieves competitive results with just 345 million. ## Key Capabilities **Document Understanding**: NEO excels at document analysis tasks, scoring 89.9 on DocVQA with the 2B model. The any-resolution input support means documents are processed at their native resolution rather than being resized or tiled, preserving fine-grained text and layout information. **Visual Reasoning**: On academic reasoning benchmarks like MMMU, NEO-8B achieves 54.6, demonstrating strong visual comprehension and multi-step reasoning capabilities across diverse visual domains including charts, diagrams, and scientific figures. **Hallucination Resistance**: NEO shows strong performance on hallucination benchmarks (POPE, HalluciBench), suggesting that the native architecture's unified processing reduces the tendency to generate text that contradicts visual evidence, a common problem in modular VLMs where the adapter can introduce misalignment. **Text Recognition**: With 82.4 on OCRBench for the 8B model, NEO demonstrates strong OCR capabilities without requiring a specialized text recognition module, further validating the native architecture's ability to handle diverse visual tasks through unified processing. ## Code and Usage The project releases evaluation and training frameworks: ```python # Using VLMEvalKit for evaluation from vlmeval import NEOModel model = NEOModel.from_pretrained("Paranioar/NEO-8B-SFT") result = model.evaluate(benchmark="mmmu") ``` Model weights are hosted on Hugging Face under the Paranioar organization, with variants for each training stage (PT, MT, SFT) at both 2B and 9B scales. ## Limitations NEO's native architecture, while elegant, comes with trade-offs. The monolithic design means the model cannot easily leverage improvements in standalone vision encoders, as modular systems can by simply swapping in a better ViT. Training from scratch requires significant compute, and the current models are limited to 2B and 9B scales, while competing modular VLMs are available at 70B+ parameters. The any-resolution input support, while beneficial for accuracy, increases computational cost for high-resolution images compared to fixed-resolution approaches. The research community around native VLMs is smaller than the modular VLM ecosystem, meaning fewer pre-built tools and integration options are available. ## Who Should Use This NEO is particularly relevant for researchers exploring native multimodal architectures who want a well-documented, open-source foundation to build upon. Teams working on document understanding and OCR applications will benefit from the strong DocVQA and OCRBench performance. Developers seeking data-efficient VLM training will find NEO's three-stage methodology and modest data requirements valuable for custom model training. Organizations interested in deploying compact VLMs (2B parameters) for edge or resource-constrained environments should evaluate NEO-2B as a capable yet lightweight option. Anyone contributing to the ICLR 2026 discourse on native vs. modular VLM architectures will find NEO's codebase and results essential reference material.