Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
VoxCPM is a tokenizer-free Text-to-Speech system from OpenBMB that directly generates continuous speech representations through an end-to-end diffusion autoregressive architecture. Rather than converting audio into discrete tokens, VoxCPM models speech as a continuous signal, which lets it sidestep the quantization bottleneck that limits many modern TTS systems and produce noticeably more natural, expressive output. The project has crossed 28,900 GitHub stars and topped GitHub Trending across multiple releases. ## VoxCPM2: The 2B Leap VoxCPM2 is the latest major release and a substantial upgrade over the original 0.5B model. It scales the architecture to roughly 2 billion parameters trained on more than 2 million hours of multilingual speech data, and it now supports 30 languages, a dedicated Voice Design mode, controllable voice cloning, and 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone, it sits at the intersection of OpenBMB's language-model and speech research. ## Tokenizer-Free Architecture Most contemporary TTS pipelines convert audio into discrete acoustic tokens, generate those tokens with a language model, and then decode them back to a waveform. That discretization step throws away fine-grained acoustic detail. VoxCPM instead predicts continuous speech representations directly using a diffusion autoregressive approach, preserving subtle prosodic and timbral information that token-based systems tend to flatten. The result is synthesis that captures natural rhythm, emphasis, and emotional nuance more faithfully. ## Key Capabilities ### 30-Language Multilingual Support VoxCPM2 accepts input text in any of 30 supported languages — including English, Chinese, Japanese, Korean, Arabic, Hindi, Spanish, French, German, Russian, and many more — and synthesizes directly without requiring an explicit language tag. It additionally handles a range of Chinese dialects such as Cantonese, Sichuanese, and Shanghainese. ### Voice Design One of the standout additions in VoxCPM2 is Voice Design: the ability to create a brand-new voice from a natural-language description alone. A developer can specify gender, age, tone, emotion, and pace in plain text and obtain a matching voice without supplying any reference audio. This is a meaningful step beyond conventional cloning, which always requires a sample of the target speaker. ### Controllable Voice Cloning VoxCPM2 can clone a voice from a short reference clip while offering style guidance to steer emotion, pace, and expression, all while preserving the speaker's original timbre. For maximum fidelity, an "Ultimate Cloning" mode accepts both reference audio and its transcript so the model continues seamlessly from the reference, reproducing timbre, rhythm, emotion, and style. ### 48kHz High-Quality Audio The model accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio through AudioVAE V2's asymmetric encode/decode design, which includes built-in super-resolution. No external upsampler is needed to reach high-fidelity output. ## Performance and Deployment VoxCPM2 supports real-time streaming with a real-time factor (RTF) as low as roughly 0.3 on an NVIDIA RTX 4090, and around 0.13 when accelerated by Nano-vLLM or vLLM-Omni. The latter provides official omni-modal serving with PagedAttention and an OpenAI-compatible API, making the model practical to deploy behind a production endpoint. Installation is a single pip command (`pip install voxcpm`), and the project ships a Python API, CLI, web demo, and ReadTheDocs documentation. ## Context-Aware Synthesis Beyond raw voice generation, VoxCPM automatically infers appropriate prosody and expressiveness from the text content itself. Punctuation, sentence structure, and semantic cues influence the delivery, so the same voice can sound measured in a formal passage and lively in dialogue without manual tuning. ## Licensing and Ecosystem VoxCPM2 releases both weights and code under the permissive Apache-2.0 license, allowing unrestricted commercial use. Model weights are distributed on Hugging Face and ModelScope, and the project maintains an active community through Discord and Feishu, plus an accompanying technical report on arXiv. The momentum is evident: VoxCPM-0.5B hit #1 on HuggingFace Trending and VoxCPM1.5 reached #1 on GitHub Trending before this 2B release. ## Limitations At 2B parameters, VoxCPM2 is considerably heavier than lightweight TTS models, so CPU-only inference is impractical for real-time use and a capable GPU is recommended. The diffusion autoregressive approach, while high quality, is more compute-intensive per second of audio than purely autoregressive token models. As with any high-fidelity voice-cloning system, the controllable cloning features raise clear potential for misuse, and the project includes risk and limitation guidance that deployers should heed. Some of the 30 supported languages and dialects will inevitably have less training coverage than the flagship English and Chinese voices. ## Who Should Use VoxCPM2 VoxCPM2 is well suited to teams building multilingual voice products, content creators who need natural narration across many languages, developers who want to design custom voices from a text brief rather than recordings, and researchers exploring tokenizer-free speech synthesis. Its Apache-2.0 license and OpenAI-compatible serving path make it a realistic foundation for commercial deployment, not just experimentation.