Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
VoxCPM2 is OpenBMB's tokenizer-free, diffusion autoregressive text-to-speech model that has crossed 22,000 GitHub stars by combining studio-grade multilingual synthesis with first-class voice design and cloning. Built on a MiniCPM-4 backbone with a 2B-parameter LocEnc to TSLM to RALM to LocDiT pipeline that operates entirely inside an AudioVAE V2 latent space, it bypasses the discrete audio tokens that constrain most modern TTS systems and emits native 48kHz studio audio at a 6.25Hz language-model token rate. ## Why Tokenizer-Free Matters Most open TTS stacks (XTTS, F5-TTS, CosyVoice, GLM-4-Voice) commit to a residual or codec tokenizer up front, which caps prosody resolution and introduces audible artifacts at edges. VoxCPM2 instead lets the language model predict continuous latents that a diffusion head turns directly into waveforms. The benefit shows up in the numbers: 1.84% WER on Seed-TTS-eval English, 0.97% CER on the hard Chinese test, and a 1.68% average error rate across a 30-language ASR sweep. On a single RTX 4090 the model hits an RTF of about 0.30, dropping to 0.13 when served via Nano-vLLM. ## Three Modes of Voice Control The project ships three operating modes that fit different production needs. Voice Design synthesizes a brand new speaker purely from a natural-language description with no reference audio, which is useful for game NPCs, branded assistants, or anonymized narration. Controllable Voice Cloning takes a short audio clip and lets you steer emotion, pace, and expression on top of the cloned timbre. Ultimate Cloning uses a reference audio plus its transcript and continues the voice with what the project calls every vocal nuance preserved, making it suitable for audiobook continuation and long-form podcast generation. ## Languages and Dialects VoxCPM2 covers 30 languages including English, Mandarin, Japanese, Korean, French, German, Spanish, and Arabic, plus 9 Chinese dialects: Sichuan, Cantonese, Wu, Northeast, Henan, Shaanxi, Shandong, Tianjin, and Minnan. The dialect coverage is unusual in the open ecosystem and is one reason the project has picked up traction in Chinese consumer apps and accessibility tools where standard Mandarin TTS sounds out of place. ## Training Scale and Hardware Bar Under the hood the model is trained on more than 2 million hours of multilingual speech and weighs in at 2 billion parameters. It needs only about 8GB of VRAM to run inference, which keeps it on a single mid-range GPU and even within reach of consumer 12GB cards. Parameter-efficient fine-tuning via LoRA works with 5-10 minutes of audio, making it cheap to add custom voices to existing deployments. ## Production Deployment The repository ships an OpenAI-compatible HTTP endpoint via Nano-vLLM and vLLM-Omni, so existing TTS clients written against the OpenAI audio API can swap in VoxCPM2 with a base URL change. Streaming generation is supported, which matters for interactive agents that need to start speaking before the full utterance is computed. Weights are mirrored on both Hugging Face and ModelScope. ## Where It Fits VoxCPM2 is the right pick when you need open, commercial-friendly multilingual TTS with both ad-hoc voice creation and high-fidelity cloning. The Apache 2.0 license makes it directly usable in commercial products, the latency is good enough for real-time agents, and the dialect coverage opens markets that other open TTS projects skip. ## Limitations The 8GB VRAM requirement and 2B parameter count mean it is heavier than smaller streaming TTS systems like Piper, so embedded deployment is harder. The voice design mode is impressive but still less reliable than direct reference cloning for matching very specific target voices. And while 30 languages are supported, the WER gap between top-resource languages (English, Mandarin) and lower-resource ones in the 30-language sweep is still meaningful, so production use in those languages benefits from light fine-tuning.