Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
MOSS-TTS is an open-source speech and sound generation model family developed by the OpenMOSS team and MOSI.AI. Rather than shipping a single monolithic TTS model, the project releases a coordinated set of specialized models that cover production-grade text-to-speech, multi-speaker dialogue, real-time voice agents, voice design from text prompts, sound effect generation, and an ultra-light on-device variant. By late May 2026 the repository has crossed 2,500 GitHub stars on the back of strong Seed-TTS-eval results and a viral v1.5 release that extends multilingual support to 31 languages. ## A Family, Not a Single Model The core idea behind MOSS-TTS is that different production scenarios demand different architectural trade-offs. MossTTSDelay uses multi-head parallel RVQ prediction with delay-pattern scheduling and powers the 8B flagship MOSS-TTS-v1.5 with long-context stability. MossTTSLocal employs time-synchronous RVQ blocks with a depth transformer at 1.7B parameters, making it the right choice for flexible deployments. MossTTSRealtime, also at 1.7B, accepts hierarchical text-audio inputs and achieves 180ms TTFB for live voice agents. Around these sit MOSS-TTSD for dialogue, MOSS-VoiceGenerator for prompt-driven voice design, MOSS-SoundEffect-v2.0 for sound effects, and MOSS-TTS-Nano at 0.1B for CPU-only inference. ## Unified Audio Tokenizer At the foundation of the family is the MOSS-Audio-Tokenizer, a 32-layer residual vector quantizer that compresses 24 kHz audio to a 12.5 Hz frame rate with bitrates configurable between 0.125 and 4 kbps. This shared tokenizer lets every model in the family interoperate on the same discrete audio interface, and OpenMOSS reports that it leads open-source tokenizers on reconstruction quality at comparable bitrates. ## Multilingual and Expressive Control MOSS-TTS-v1.5 covers 31 languages, including Mandarin, Cantonese, English, Arabic, Hindi, Japanese, Korean, Russian, Spanish, Swahili, Thai, and Vietnamese, with explicit code-switching support. Fine-grained controls let users feed Pinyin and phoneme overrides, set token-level duration, and insert explicit pause markers such as [pause 3.2s]. For dialogue scenarios, MOSS-TTSD-v1.0 was reported to outperform closed-source systems like Doubao and Gemini 2.5 Pro in subjective evaluations. ## Real-Time Voice Agents MOSS-TTS-Realtime is built specifically for conversational stacks. Combined with an upstream LLM, it reaches roughly 377ms from prompt to first audio byte, with an RTF of 0.51 on consumer-class GPUs. This is the latency band where voice agents start to feel natural rather than walkie-talkie like, and it lands in the open-source ecosystem with an Apache 2.0 license attached. ## Deployment Options The project supports a PyTorch reference path for maximum compatibility, an SGLang backend that delivers roughly 3x faster generation throughput, and a llama.cpp backend that runs on 8GB GPUs without PyTorch via quantized GGUF weights and ONNX audio codecs. mlx-audio support brings the smaller models to Apple Silicon. Gradio demos ship alongside each model, and the community has built ComfyUI nodes, an OpenAI-compatible API wrapper, and podcast generation pipelines. ## Where It Fits MOSS-TTS is a credible open alternative to closed voice stacks for teams that need controllable multilingual TTS, low-latency voice agents, or sound design pipelines without sending audio to a third party. The family approach means operators do not have to pick between quality and latency: the right MOSS model usually exists for each constraint, and they all speak the same audio token vocabulary. ## Limitations Running the full family is not free. The 8B flagship needs serious GPU memory, and squeezing the realtime model below 200ms TTFB still benefits from accelerators. Some pieces of the documentation and the model card licensing terms are evolving rapidly, so production adopters should pin specific revisions. As with most modern TTS systems, voice cloning capability raises the usual responsible-use questions, and the maintainers explicitly call for users to respect consent and local regulation.