Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
SenseVoice is the MIT-licensed multilingual speech understanding foundation model from Alibaba's FunAudioLLM team. With 8,500+ GitHub stars and the v1.0.0 release on May 25, 2026, it has emerged as the highest-throughput open-source ASR option for teams that need more than transcription. The non-autoregressive architecture processes 10 seconds of audio in around 70 milliseconds, which is roughly 15 times faster than Whisper-Large while delivering competitive accuracy across 50+ languages. ## What SenseVoice Actually Does The project bundles four speech understanding capabilities into a single end-to-end model: automatic speech recognition (ASR), speech emotion recognition (SER), audio event detection (AED), and spoken language identification (LID). A single forward pass returns a transcript, an emotion label, audio event tags such as laughter or music, and the detected language, which collapses what would normally be a multi-model pipeline into one inference call. The model was trained on more than 400,000 hours of multilingual audio, which is the largest public training set among open ASR projects of this class. ## Non-Autoregressive Architecture The headline differentiator is that SenseVoice does not decode token by token. The non-autoregressive design predicts the full transcript in parallel, which is what produces the 15x speedup over Whisper-Large and the 70ms-per-10s throughput. For real-time products such as live captioning, voice agents, and meeting transcription this changes the deployment math: a single GPU can serve many more concurrent streams than a Whisper-based stack at comparable accuracy. ## 50+ Language Coverage with Rich Asian Language Support SenseVoice covers more than 50 languages with particularly strong results on Chinese (Mandarin and Cantonese), English, Japanese, and Korean, where it matches or beats Whisper-Large on standard benchmarks. For teams building products for Asian markets this is a meaningful gap against Whisper, whose accuracy on Cantonese and Japanese has historically lagged English. Spoken language identification is built in, so multilingual deployments do not need a separate language detector in front of the model. ## Speaker Diarization and Timestamps The May 2026 update added speaker diarization, which lets users separate who said what in multi-speaker recordings without bolting on a second model. CTC timestamp alignment, added in late 2024, produces word-level timing for caption rendering and downstream search. These two features together close the gap with commercial ASR APIs for podcast and meeting use cases. ## SenseVoice-Small for Production, Large for Reference The project ships two model sizes. SenseVoice-Small is the production target, optimized for the 15x-faster-than-Whisper throughput claim and the size most teams actually deploy. SenseVoice-Large is positioned as a reference checkpoint for benchmarking and research. ONNX and libtorch exports are supported, so the small model deploys on CPU-only edge servers and embedded devices in addition to GPU inference. ## Limitations The non-autoregressive design that powers the speed advantage also makes streaming partial transcripts harder to surface than with autoregressive models, so live captioning use cases need a chunked-inference wrapper rather than a true streaming decoder. Code-switching across more than two languages within a single utterance is still weaker than language-pure transcription, which matters for multilingual call centers. The license inherits from FunASR (MIT), but the pretrained model weights are released under the Model License Agreement on ModelScope, so commercial users should review the weight license terms separately from the code license. Finally, accuracy on low-resource languages outside the 50+ headline list drops noticeably, so projects targeting smaller language markets should benchmark before committing. Within those caveats, SenseVoice is the strongest open-source choice in 2026 for teams that need fast, multilingual ASR with built-in emotion and event detection, particularly for Asian-language-heavy workloads where Whisper has historically been weakest.