Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

VibeVoice - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

VibeVoice

MicrosoftMIT

View on GitHub

TTS47.4K Stars5.3K Forks126 views

VibeVoice is Microsoft's open-source frontier voice AI family, packaging a 7B-parameter ASR model, a 1.5B-parameter long-form TTS model, and a 0.5B streaming TTS model under a single MIT-licensed repository. Released to coincide with an ICLR 2026 Oral, the project has already passed 47,000 GitHub stars and 5,200 forks, putting it well ahead of every other text-to-speech repository on GitHub Trending this month. The headline capability is generating up to 90 minutes of multi-speaker synthesized speech in one pass, with up to 4 distinct speakers, a regime that no prior open-weights model has approached. ## 7.5 Hz Continuous Tokenizers The technical core of VibeVoice is a pair of continuous speech tokenizers, one acoustic and one semantic, that run at an ultra-low 7.5 Hz frame rate. Conventional discrete-codec TTS pipelines (EnCodec, SoundStream, DAC) operate at 50 to 100 tokens per second, which forces long-form generation through a prohibitively long context window. By compressing speech to 7.5 Hz of continuous latents and pairing the LLM backbone with a small diffusion head that adds the fine-grained acoustic detail, VibeVoice can keep an entire 90-minute conversation in context without quality collapse. This is what unlocks the long-form podcast and audiobook use cases that competing systems handle only by stitching short segments together. ## Three Models, One Stack VibeVoice-TTS is the 1.5B production-quality long-form model, covering English, Chinese, and a multilingual head with 11 English style voices and 9 additional language variants. VibeVoice-Streaming is a 0.5B sibling tuned for roughly 300 ms first-audio latency and 10-minute generation budgets, aimed at agent and call-center deployments. VibeVoice-ASR is the 7B recognition counterpart that closes the loop: it transcribes up to 60 minutes of audio in a single forward pass with speaker identification, word-level timestamps, and customizable hotword recognition. Sharing the same tokenizer and a unified codebase makes round-trip speech-LLM-speech pipelines noticeably simpler than gluing Whisper and a separate TTS model together. ## Long-Form Multi-Speaker Quality The demo set on Microsoft's project page emphasizes podcast-style content with two to four voices, dialogue tags, natural turn-taking, and consistent speaker identity across an hour-plus of audio. Internal evaluations reported in the accompanying paper show VibeVoice-TTS outperforming VALL-E 2, NaturalSpeech 3, and the previous SOTA open models on long-form MOS and speaker similarity metrics, while remaining within real-time on a single H100. For developers, the practical takeaway is that multi-speaker conditioning is built into the model rather than tacked on through prompt engineering. ## Open Weights, Cautious Release All three models ship with weights on Hugging Face under the MIT license, with reference inference code, example notebooks, and a Gradio demo in the repository. Microsoft has paired the release with an unusually direct responsible-use notice: the model card explicitly warns against commercial deployment, flags risks around deepfake generation, bias inheritance from base LLMs, and unpredictable outputs, and asks users to deploy in a lawful manner. There is no built-in watermarking, so downstream adopters are expected to add provenance and consent mechanisms themselves. ## Positioning Against Competitors VibeVoice slots in above Fish Speech and CosyVoice on long-form quality and above Dia on speaker count and duration, while remaining lighter to run than commercial systems like ElevenLabs Studio. Its main weak spots are language coverage (narrower than OmniVoice's 600+ languages), the lack of a fine-tuning recipe in the initial release, and the explicit non-commercial guidance which complicates production adoption. For teams building podcasts, audiobooks, accessibility tooling, or research on long-form speech, however, it is now the obvious open baseline.

Key Features

Long-form TTS generating up to 90 minutes of speech with up to 4 distinct speakers in one pass
7.5 Hz continuous acoustic and semantic tokenizers for efficient ultra-long-context synthesis
VibeVoice-Streaming variant delivering ~300 ms first-audio latency for real-time agent use cases
VibeVoice-ASR sibling transcribing up to 60 minutes with speaker ID, timestamps, and hotwords
Three model sizes (0.5B streaming, 1.5B TTS, 7B ASR) sharing one tokenizer and codebase
11 English style voices plus 9 multilingual variants with multi-speaker dialogue conditioning
ICLR 2026 Oral accompanied by open weights on Hugging Face under MIT license
Reference Gradio demo, inference notebooks, and Python API in the official repository

Related Projects

TrendingTTS

GitHub

58.9K6.4K

GPT-SoVITS

RVC-Boss

MIT53

Open Source

VibeVoice

Key Features

Tags

Related Projects

GPT-SoVITS

ChatTTS

Bark

VoxCPM2