Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Qwen3-ASR is an open-source series of advanced automatic speech recognition models developed by the Qwen team at Alibaba Cloud, released in early 2026. The series includes two primary models — Qwen3-ASR-1.7B and Qwen3-ASR-0.6B — both built on the strong audio understanding capabilities of the Qwen3-Omni foundation model, leveraging large-scale speech training data across 52 supported languages and dialects. The models are architected for both streaming and offline inference through a single unified model, eliminating the need to maintain separate pipelines for real-time and batch use cases. This design choice makes Qwen3-ASR highly practical for production deployments where latency requirements vary. Long-form audio transcription is handled natively, supporting continuous audio streams without chunking workarounds. A notable differentiator is support for 30 natural languages plus 22 Chinese dialects, making it one of the most comprehensive open-source models for Mandarin and regional Chinese language transcription. Beyond standard speech, Qwen3-ASR handles singing voice and songs with background music (BGM), a challenging domain where most ASR models fail entirely. Qwen3-ASR includes a dedicated forced alignment model that produces word or character-level timestamps for 11 languages — critical for subtitle synchronization and content indexing workflows. Multiple inference backends are supported, including HuggingFace Transformers and vLLM for high-throughput serving. A Gradio-based web UI provides accessible demo capabilities, and Docker deployment and fine-tuning guides are included in the repository. With 2,100 GitHub stars and 207 forks within weeks of release, the project demonstrates significant community adoption. The 1.7B parameter model achieves state-of-the-art performance among open-source ASR models while remaining competitive with commercial APIs, offering an efficient balance of accuracy, multilingual coverage, and deployment flexibility.