Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

VoxCPM - Open Source | Evermx | Evermx

Back to Open Source

Trending

VoxCPM

OpenBMBApache-2.0

View on GitHub

Audio19.1K Stars2.3K Forks101 views

VoxCPM is OpenBMB's open-source text-to-speech system built around a tokenizer-free, diffusion-autoregressive design. The current release, VoxCPM2, is a 2-billion-parameter multilingual model that generates 48 kHz studio-quality audio and supports voice cloning across 30 languages. With more than 19,000 GitHub stars and over 2,200 forks, it has become one of the most active open TTS projects of 2026. The code is released under Apache-2.0 and the weights are available on both Hugging Face and ModelScope, which makes the model usable in commercial products without restrictive terms. ## Architecture VoxCPM2 departs from the discrete-codec pattern that dominates open-source TTS. Instead of quantizing audio into a fixed token vocabulary, the model operates directly in the continuous latent space of AudioVAE V2 and uses a four-stage pipeline composed of a Local Encoder (LocEnc), a Text-Speech Language Model (TSLM), a Residual Acoustic Language Model (RALM), and a Local Diffusion Transformer (LocDiT) that produces the final acoustic features. Removing the discrete token bottleneck is the main reason VoxCPM2 reaches 48 kHz output quality without the buzz and metallic artifacts that smaller codec-based systems often exhibit. Reference audio is accepted at 16 kHz, and the model performs implicit super-resolution as part of synthesis, so users do not need to provide high-sample-rate reference clips. ## Voice Capabilities VoxCPM2 exposes three voice control modes. Voice design lets users specify a voice purely through a text description ("a calm middle-aged female narrator with a slight Tokyo accent"), useful when no reference recording exists. Controllable cloning combines a short reference clip with style guidance text to steer prosody, emotion, or speaking rate. Ultimate cloning takes a reference clip plus a matching transcript and preserves fine vocal detail, including breath patterns and characteristic prosodic ticks. These three modes cover the practical range from generic narration to high-fidelity voice replication. ## Language Coverage The model supports 30 languages, including Arabic, Chinese, English, French, German, Japanese, Korean, Spanish, Portuguese, Russian, and Vietnamese, along with nine regional Chinese dialects (Cantonese, Shanghainese, Sichuanese, and others). The multilingual coverage is broader than most other open TTS releases at this parameter scale and is supported by zero-shot cross-lingual voice transfer. ## Deployment and Performance VoxCPM2 ships with three inference paths. Standard PyTorch inference runs on a single RTX 4090 at a real-time factor (RTF) of around 0.3. Nano-vLLM acceleration drops the RTF to roughly 0.13 on the same hardware, which is fast enough for interactive applications. A vLLM-Omni backend exposes an OpenAI-compatible API endpoint that can sit behind existing TTS-consuming services without code changes on the client side. The repository ships a Web UI, a Python API, CLI tools, and Docker images for self-hosted deployment. A live demo runs on Hugging Face Spaces for evaluation without local setup. Fine-tuning recipes are provided for both full supervised fine-tuning and LoRA adapters, which lowers the cost of producing custom voices for a specific product or persona. ## Streaming and Context Awareness The TSLM stage supports streaming generation, so the first audio chunk can begin playing before the full text has been processed. This matters for conversational agents and voice assistants where time-to-first-audio determines whether the interaction feels responsive. Context awareness is built into the architecture, which means VoxCPM2 modulates prosody based on surrounding text rather than rendering each sentence in isolation, producing more natural-sounding paragraph-length synthesis than codec-based alternatives at the same scale. ## Use Cases VoxCPM2 fits a range of practical scenarios. Audiobook and podcast pipelines benefit from the 48 kHz output quality and long-form prosody. Voice agents and assistants benefit from streaming output and low RTF. Localization teams use the multilingual zero-shot cloning to produce dubs that retain the original speaker's voice across languages. Content creators use voice design mode to generate distinct voices for characters in narrated content without needing a voice actor. Research groups working on speech synthesis have an unusually transparent reference implementation to study, since the architecture, training notes, and weights are all open. ## Limitations The 2B parameter count puts VoxCPM2 above what most consumer hardware can run comfortably. A modern discrete GPU is effectively required for usable latency. The model does not include speech recognition, voice activity detection, or related ASR components, so production voice agents will need to pair VoxCPM2 with a separate ASR model. Although coverage spans 30 languages, the quality varies, with English, Chinese, and the larger European languages clearly stronger than the long tail. Voice design mode without a reference clip is more variable than controllable cloning with a reference, so applications that need consistent voices across sessions should prefer the cloning modes. Finally, the tokenizer-free design that produces the high audio quality also makes intermediate states harder to inspect than discrete-token models, which can complicate debugging of synthesis artifacts. ## Who Should Use VoxCPM VoxCPM2 is a strong choice for teams that need open-source TTS at near-commercial quality and want freedom from per-character pricing or rate limits. Multilingual product teams benefit from the broad language coverage and consistent voice transfer. Audiobook and podcast producers benefit from the 48 kHz output and long-form prosody. Voice agent developers benefit from streaming output, OpenAI-compatible serving, and a permissive license that allows commercial deployment without negotiation.

Key Features

Tokenizer-free diffusion-autoregressive architecture (LocEnc, TSLM, RALM, LocDiT)
2B parameters producing 48 kHz studio-quality audio output
30 languages plus nine Chinese dialects with zero-shot cross-lingual cloning
Three voice modes: voice design, controllable cloning, ultimate cloning
Streaming generation with low time-to-first-audio latency
Real-time factor of ~0.13 on RTX 4090 via Nano-vLLM acceleration
OpenAI-compatible API endpoint through the vLLM-Omni backend
Full SFT and LoRA fine-tuning recipes for custom voice training