Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction MOSS-TTS Family is an open-source speech and sound generation model family developed by MOSI.AI and the OpenMOSS team at Shanghai AI Laboratory. With five specialized production-ready models ranging from 1.7B to 8B parameters, MOSS-TTS covers an extraordinary breadth of audio generation capabilities: stable long-form speech synthesis, multi-speaker dialogue generation, voice design from text prompts, real-time streaming TTS, and even sound effect generation. While most open-source TTS projects focus on a single model for general speech synthesis, MOSS-TTS takes a fundamentally different approach by offering a complete family of models, each optimized for distinct production scenarios. This makes it one of the most comprehensive open-source audio generation suites available today. ## Architecture and Models The MOSS-TTS Family consists of five distinct models: | Model | Parameters | Focus | |-------|-----------|-------| | MOSS-TTS | 8B | Flagship TTS with voice cloning and phoneme control | | MOSS-TTSD v1.0 | 8B | Expressive multi-speaker dialogue synthesis | | MOSS-VoiceGenerator | 1.7B | Text-prompt voice design without reference audio | | MOSS-TTS-Realtime | 1.7B | Low-latency real-time voice agents | | MOSS-SoundEffect | 8B | Environmental sound effect generation | **MOSS-TTS** is the flagship model supporting zero-shot voice cloning, ultra-long speech generation, token-level duration control, and multilingual synthesis across 20 languages including Chinese, English, Japanese, Korean, German, French, Spanish, Arabic, and Russian. **MOSS-TTSD** focuses on multi-speaker dialogue generation with expressive prosody. In evaluations, it outperformed leading closed-source models in naturalness and speaker consistency for conversational scenarios. **MOSS-VoiceGenerator** is particularly innovative, allowing users to create entirely new speaker timbres from free-form text descriptions without any reference audio. Instead of cloning an existing voice, you can describe the voice characteristics you want. **MOSS-TTS-Realtime** is optimized for interactive voice agents requiring low-latency, continuous speech generation across multi-turn conversations, making it ideal for chatbot and voice assistant applications. **MOSS-SoundEffect** generates environmental audio including nature sounds, urban noise, musical instruments, and other effects with precise category and duration control. ## Key Capabilities **20-Language Multilingual Support**: Native synthesis across Chinese, English, German, Spanish, French, Japanese, Italian, Korean, Russian, Arabic, and 10 additional languages with natural prosody preservation. **PyTorch-Free Inference**: Supports llama.cpp + ONNX Runtime for deployment without PyTorch dependencies, enabling lighter production environments and edge deployment. **SGLang Backend**: Integration with SGLang for accelerated inference, achieving approximately 3x faster generation throughput compared to standard PyTorch inference. **FlashAttention 2 Optimization**: Native support for FlashAttention 2, reducing memory footprint and increasing inference speed for the larger 8B parameter models. **Quantized GGUF Weights**: Pre-quantized model weights available in GGUF format for efficient CPU and mixed CPU/GPU inference on consumer hardware. **Fine-Grained Control**: Phoneme-level pronunciation control, token-level duration adjustment, and Pinyin-level synthesis control for precise speech output customization. ## Limitations With 912 GitHub stars, MOSS-TTS has a smaller community compared to projects like Chatterbox or Coqui TTS, which means fewer community-contributed improvements and integrations. The flagship 8B parameter models require substantial GPU memory for inference, making the smaller 1.7B Realtime and VoiceGenerator variants more practical for resource-constrained deployments. Documentation is primarily in Chinese with English translations, which may present friction for non-Chinese-speaking developers. The sound effect generation model, while unique, has limited category coverage compared to dedicated audio generation tools. Real-time streaming latency, though improved, still depends heavily on hardware configuration. ## Who Should Use This MOSS-TTS is ideal for teams building comprehensive voice AI products that need multiple audio capabilities under a single framework. Voice assistant developers benefit from the Realtime variant's low-latency design. Game and media studios can leverage both the VoiceGenerator for character voice creation and SoundEffect for environmental audio. Multilingual content platforms requiring synthesis across Asian and European languages will find the 20-language support particularly valuable. Researchers exploring TTS architectures gain from the variety of model designs and the PyTorch-free inference options.
microsoft
Open-source frontier voice AI for TTS and ASR
resemble-ai
Family of SoTA open-source TTS models by Resemble AI with zero-shot voice cloning, 23+ language support, and paralinguistic controls across 350M-500M parameter variants.