Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
OmniVoice is an Apache-2.0 open-source text-to-speech system from the k2-fsa team that claims the broadest language coverage of any zero-shot TTS model to date, supporting 600+ languages with voice cloning and voice design. Released on 2026-03-31 and now at 6,410 GitHub stars with 946 forks just under two months later, OmniVoice has become the standard reference for multilingual open TTS thanks to a novel diffusion-language-model-style architecture that runs at a real-time factor of 0.025 — roughly 40x faster than real-time on a single GPU. ## Why 600+ Languages Matters Most open TTS systems support a handful of high-resource languages — usually English, Mandarin, and a small set of European languages. Models that claim broader coverage tend to do so via XPhone-style universal phonemizers that produce muffled, accented output for under-represented languages. OmniVoice instead trains directly on a massively multilingual speech corpus aligned through the icefall toolkit (the same lineage as Sherpa-ONNX and k2), reaching languages with as little as a few hours of public-domain audio. The result is intelligible speech across long-tail languages — Swahili, Tagalog, Quechua, Welsh — where the open ecosystem previously offered nothing usable. ## Diffusion Language Model Architecture The project describes its backbone as a "diffusion language model-style architecture," combining the autoregressive token-prediction structure of an LLM-based TTS with a diffusion objective over continuous acoustic latents. The team frames it as "clean, streamlined, and scalable," trading the discrete-codec complexity of VALL-E-style stacks for a single end-to-end network. The same backbone handles voice cloning from 3 to 10 seconds of reference audio and voice design from speaker attributes (gender, age, pitch, accent, whisper style), with non-verbal symbols and explicit pronunciation correction available for fine-grained control. ## 40x Faster Than Real-Time The RTF of 0.025 means OmniVoice generates one second of audio in roughly 25 milliseconds, which makes it suitable for interactive applications even without batching. Hardware acceleration extends well beyond CUDA: NVIDIA GPUs, Apple Silicon via the MPS backend, and Intel Arc GPUs via the XPU backend are all first-class targets. That breadth matches the k2-fsa team's long-standing focus on edge and CPU-friendly speech, and it positions OmniVoice as one of the few open TTS stacks that runs comfortably on M-series Macs without quality compromises. ## Voice Cloning and Voice Design Voice cloning expects a 3-10 second reference clip and operates zero-shot — no per-speaker fine-tuning required. The documentation is candid about the trade-offs: cross-lingual cloning carries over an accent from the reference audio's source language, and voice design reliability degrades on lower-resource languages where the speaker-attribute space is less well covered. For high-resource languages, the design controls produce surprisingly distinct output styles, which is useful for narration, character voices, and accessibility tools that need configurable rather than identity-bound voices. ## Install and Integration OmniVoice is implemented entirely in Python and distributed via PyPI (`pip install omnivoice`) and GitHub, with pretrained weights on Hugging Face. The latest release (0.1.5, April 2026) ships with a CLI, a Python API, and Gradio examples. Integration with the broader k2 / icefall / sherpa ecosystem means it composes naturally with k2-fsa's existing ASR models for two-way speech pipelines, and it can be deployed alongside Sherpa-ONNX for low-latency on-device serving. ## Limitations The license is liberal (Apache 2.0) but the project warns that voice cloning carries the usual deepfake risks and does not ship watermarking. Quality on the very longest tail of the 600+ language list is intelligible rather than natural, and reference-audio length sensitivity means out-of-distribution clips (very short, noisy, or non-speech) can degrade cloning. The model is also not yet optimized for very long-form (multi-minute) single-pass synthesis — that is VibeVoice's territory — and OmniVoice is best used in the 5-60 second-per-clip regime that most TTS applications actually need.