Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction Qwen3-TTS is an open-source text-to-speech model series developed by Alibaba Cloud's Qwen team, offering stable, expressive, and streaming speech generation with free-form voice design and vivid voice cloning capabilities. With 9,600+ GitHub stars and 1,200+ forks, it represents a significant step forward in open-source multilingual TTS technology, covering 10 major languages with models ranging from 0.6B to 1.7B parameters. The TTS landscape has long been dominated by proprietary systems from major cloud providers. Qwen3-TTS challenges this by open-sourcing a production-grade system that matches or exceeds commercial alternatives in naturalness and expressiveness, built on a novel discrete multi-codebook language model architecture. ## Architecture and Design Qwen3-TTS introduces a discrete multi-codebook LM architecture for end-to-end speech modeling, paired with a custom acoustic tokenizer called Qwen3-TTS-Tokenizer-12Hz. This tokenizer compresses audio into discrete tokens at 12Hz, significantly lower than typical 50-75Hz tokenizers, enabling longer context windows and more efficient generation. | Component | Purpose | |-----------|--------| | Qwen3-TTS-Tokenizer-12Hz | Custom acoustic compression at 12Hz | | Multi-Codebook LM | End-to-end speech modeling with discrete tokens | | Dual-Track Streaming | Hybrid system for low-latency generation | | Voice Design Module | Natural language voice description to speech | The dual-track hybrid streaming generation system achieves 97ms latency for streaming applications, making it practical for real-time conversational AI and voice assistant deployments. ## Model Variants The series ships in three functional variants across two sizes: **Base Models (0.6B / 1.7B)**: Core TTS models for standard text-to-speech conversion with high naturalness and stability. The 0.6B variant targets edge deployment while the 1.7B delivers maximum quality. **CustomVoice Models (0.6B / 1.7B)**: Specialized for voice cloning from as little as 3 seconds of reference audio. These models capture speaker timbre, prosody, and speaking style with minimal input. **VoiceDesign Models (0.6B / 1.7B)**: Enable voice creation through natural language descriptions rather than reference audio. Describe the desired voice characteristics in text, and the model generates speech matching that description. ## Key Capabilities **10-Language Support**: Native coverage of Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian with consistent quality across all languages. **97ms Streaming Latency**: The dual-track streaming system enables real-time applications with sub-100ms first-byte latency, suitable for conversational AI and live interactions. **3-Second Voice Cloning**: Clone any voice from just 3 seconds of reference audio, capturing essential speaker characteristics while maintaining generation quality. **Natural Language Voice Design**: Create entirely new voices by describing desired characteristics in plain text, such as "a warm, mature female voice with a slight British accent." **9 Premium Timbres**: Built-in collection of high-quality pre-designed voices for immediate use without cloning or design. **Instruction-Based Control**: Fine-grained control over speech characteristics through natural language instructions, including speaking rate, emotion, emphasis, and style adjustments. ## Developer Integration Qwen3-TTS integrates through standard Python APIs: ```bash pip install qwen3-tts ``` Basic generation with a pre-designed voice: ```python from qwen3_tts import Qwen3TTS model = Qwen3TTS.from_pretrained("Qwen/Qwen3-TTS-1.7B") audio = model.generate( text="Welcome to Qwen3 text to speech.", voice="premium_01" ) audio.save("output.wav") ``` Voice cloning from a 3-second sample: ```python audio = model.clone_and_generate( text="Speaking with a cloned voice.", reference_audio="speaker.wav" ) ``` ## Limitations While Qwen3-TTS delivers strong results across its 10 supported languages, quality varies between languages with Chinese and English receiving the most training attention. The 12Hz tokenizer, while efficient, can occasionally produce artifacts in very rapid speech segments. Voice design from text descriptions is creative but less precise than voice cloning for matching specific target speakers. The 1.7B model requires significant GPU memory for inference, and the 0.6B variant trades noticeable quality for efficiency. Streaming mode quality is slightly below non-streaming generation. The license terms should be reviewed carefully for commercial deployment scenarios. ## Who Should Use This Qwen3-TTS is well-suited for developers building multilingual voice applications, particularly those serving Asian and European language markets. Teams needing real-time streaming TTS with sub-100ms latency will find the dual-track architecture compelling. Content creators wanting to design custom voices through natural language descriptions benefit from the VoiceDesign variants. Researchers exploring discrete speech tokenization and multi-codebook architectures will find the technical approach novel and well-documented.
microsoft
Open-source frontier voice AI for TTS and ASR
resemble-ai
Family of SoTA open-source TTS models by Resemble AI with zero-shot voice cloning, 23+ language support, and paralinguistic controls across 350M-500M parameter variants.