Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

NVIDIA Canary-Qwen-2.5B - Open Source | Evermx | Evermx

Back to Open Source

Trending

NVIDIA Canary-Qwen-2.5B

nvidiaCC-BY-4.0

🤗View on HuggingFace

STT437 Stars55.1K Downloads66 views

NVIDIA Canary-Qwen-2.5B is an open speech-recognition model that currently tops the Hugging Face Open ASR Leaderboard with a mean word error rate of 5.63%. What makes it notable is not just the accuracy number but the architecture behind it: Canary-Qwen is a Speech-Augmented Language Model (SALM) that fuses a dedicated speech encoder with a general-purpose large language model, letting a single model both transcribe audio and reason about the resulting text. ## A New Architecture for ASR Traditional automatic speech recognition systems are built solely to map audio to text. Canary-Qwen takes a different path. It pairs a FastConformer speech encoder with an unmodified Qwen3-1.7B LLM decoder, connecting the two through a linear projection and LoRA adaptation. Because the language model component is left intact, the same checkpoint can operate in two distinct modes, an approach that blurs the line between a speech recognizer and a language model. ### ASR Mode In ASR mode, the model performs straightforward transcription, producing text with proper punctuation and capitalization. This is the mode measured on the Open ASR Leaderboard, where it reaches 5.63% mean WER and as low as 1.61% WER on LibriSpeech Clean. ### LLM Mode Because the decoder is a real Qwen language model, Canary-Qwen can also operate in LLM mode, reasoning over the transcript it just produced. That enables downstream tasks such as summarization and question answering directly on spoken content, without bolting on a separate model. A single pass can move from raw audio to a structured, useful answer. ## Technical Specifications Canary-Qwen-2.5B carries roughly 2.5 billion parameters and was trained on 234,500 hours of public English speech data. It is an English-only model, a deliberate scope choice that concentrates its capacity on a single language and helps explain its leaderboard-topping accuracy. Inference runs at a real-time factor of 418 RTFx, meaning it transcribes far faster than real time on appropriate NVIDIA hardware. ## Accuracy and Robustness Beyond the headline WER figures, the model is evaluated for robustness to noise across varying signal-to-noise ratios, an important property for real-world audio that is rarely studio-clean. NVIDIA also reports fairness evaluation across gender and age using the Casual Conversations dataset, reflecting growing attention to bias in speech systems. The combination of strong clean-speech accuracy and noise resilience makes it a credible choice for production transcription workloads. ## Deployment and Tooling Canary-Qwen is built for the NVIDIA NeMo toolkit, which handles both inference and any further training or fine-tuning. NeMo integration means the model slots into an established ecosystem of speech tooling, data pipelines, and optimized inference paths on NVIDIA GPUs. The model is distributed through Hugging Face with more than 55,000 downloads, and it is positioned as ready for commercial deployment. ## Why It Matters For years, the open ASR landscape was defined largely by Whisper and its ecosystem. Canary-Qwen represents a newer generation that not only edges ahead on accuracy benchmarks but rethinks what a speech model can do by embedding a genuine language model in the loop. The SALM design points toward a future where transcription and comprehension are not separate stages but a single, unified capability. ## Pros and Cons The strengths are clear: leaderboard-leading accuracy, dual ASR and LLM operation, strong noise robustness, very high inference throughput, and a path to commercial use. The trade-offs are equally clear. The model is English-only, which rules it out for multilingual deployments. The CC-BY-4.0 license is permissive but carries attribution requirements that differ from Apache or MIT. It depends on the NVIDIA NeMo toolkit and NVIDIA GPUs, so it is less portable than framework-agnostic alternatives. And at 2.5B parameters it is heavier than compact ASR models designed for edge devices. ## Who Should Use Canary-Qwen Canary-Qwen-2.5B is a strong fit for teams building English-language transcription products that demand top-tier accuracy, for applications that want to summarize or answer questions about spoken content in one model, and for organizations already invested in the NVIDIA NeMo and GPU ecosystem. Developers who need multilingual coverage or lightweight on-device inference should look elsewhere, but for high-accuracy English ASR with built-in language understanding, it sets a new bar.

Key Features

Speech-Augmented Language Model (SALM) fusing a FastConformer encoder with a Qwen LLM
Tops the Hugging Face Open ASR Leaderboard at 5.63% mean WER
Dual operation: ASR transcription mode and LLM reasoning mode
LLM mode enables summarization and Q&A directly on transcripts
Trained on 234.5k hours of public English speech data
High throughput at 418 RTFx for far-faster-than-real-time transcription
Noise-robust across varying SNR with gender/age fairness evaluation
NVIDIA NeMo toolkit integration, ready for commercial deployment