Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Qwen3-TTS is an open-source text-to-speech series from the Qwen team at Alibaba Cloud. Released as 0.6B and 1.7B models, it delivers ultra-high-quality, human-like speech generation with voice cloning, voice design, and natural language-based voice control. ## Why Qwen3-TTS Matters Open-source TTS has long forced a trade-off between naturalness, latency, and language coverage. Qwen3-TTS targets all three at once, offering one of the most extensive feature sets available in an openly licensed speech model — and it does so with a compact architecture that runs efficiently rather than requiring a heavyweight pipeline. ## Powerful Speech Representation The series is built on the self-developed Qwen3-TTS-Tokenizer-12Hz, which achieves efficient acoustic compression and high-dimensional semantic modeling of speech. It preserves paralinguistic information and acoustic environment detail, enabling high-speed, high-fidelity reconstruction through a lightweight non-DiT architecture. ## End-to-End Multi-Codebook Architecture Qwen3-TTS uses a discrete multi-codebook language-model architecture for full-information, end-to-end speech modeling. This bypasses the information bottlenecks and cascading errors of traditional LM-plus-DiT schemes, improving versatility, generation efficiency, and the overall quality ceiling. ## Extreme Low-Latency Streaming A Dual-Track hybrid streaming architecture lets a single model handle both streaming and non-streaming generation. It can emit the first audio packet right after a single character is input, with end-to-end synthesis latency as low as 97ms — fast enough for real-time interactive applications. ## Multilingual Voice Control Qwen3-TTS covers 10 major languages — Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian — plus multiple dialectal voice profiles. It supports voice cloning, free-form voice design, and instruction-driven control of timbre, emotion, and prosody, adaptively adjusting tone and rhythm from text semantics. The project ships vLLM support and fine-tuning recipes under the Apache-2.0 license.