Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction YuE (pronounced "yeah") is an open-source foundation model for full-song music generation — an open answer to closed systems like Suno.ai. From a set of lyrics and a genre or style prompt, YuE generates a complete, several-minute song with coordinated vocal and instrumental tracks. Developed by a research collaboration anchored at HKUST and the M-A-P community, the project has crossed 6,200 GitHub stars and become a reference point for anyone studying or building open lyrics-to-music systems, precisely because it ships the architecture, weights, and recipes that commercial alternatives keep closed. ## What It Is YuE is an LLM-based, two-stage music generation system: - **Stage 1 (YuE-s1)**: A 7B-parameter language model that generates semantic music tokens from the lyrics and conditioning prompt, establishing melody, structure, and vocal content. - **Stage 2 (YuE-s2)**: A 1B-parameter model that refines and upsamples the Stage 1 output into higher-fidelity acoustic tokens. This decomposition mirrors how large language models separate planning from realization: the big model decides what the song should be, and the smaller model renders it cleanly. An additional YuE-upsampler handles further audio refinement. ## Key Capabilities ### Lyrics-to-Song Generation YuE takes written lyrics and produces a full song in which the generated vocals actually sing those lyrics over a matching instrumental backing — not a hummed melody or an instrumental loop, but a coordinated vocal-plus-accompaniment mix. ### Two Inference Modes The model supports Chain-of-Thought (CoT) reasoning for generation from scratch and In-Context Learning (ICL) for conditioning on a reference. ICL mode enables voice cloning and music style transfer, letting a user steer the output toward the timbre or genre of a provided audio sample. ### Single- and Dual-Track Control Generation supports single-track output (full mix, vocal-only, or instrumental-only) as well as dual-track ICL modes, giving producers separable stems to work with rather than a single baked render. ### Multilingual, Multi-Genre Coverage YuE generates across English, Mandarin, Cantonese, Japanese, and Korean, spanning a range of musical genres rather than a single style, which makes it useful well beyond English-language pop. ## Model Variants The project ships six primary checkpoints covering its supported languages (English, Japanese/Korean, and Mandarin) in both CoT and ICL modes, plus the dedicated upsampler model. All weights are distributed through Hugging Face. ## Hardware Requirements Full-song generation is computationally heavy. A 24GB GPU can handle up to roughly two sessions — for example, one verse and one chorus — while generating complete multi-section songs (four or more sessions) calls for 80GB-class hardware such as an H800 or A100, or a multi-GPU RTX 4090 setup. For throughput context, an H800 generates about 30 seconds of music in 150 seconds, while an RTX 4090 takes around 360 seconds for the same output. ## License and Ecosystem YuE's models and weights were released under the Apache-2.0 license in January 2025, permitting commercial use. The maintainers explicitly encourage artists and creators to sample, incorporate, and even monetize the outputs, recommending attribution to "YuE by HKUST/M-A-P." The work is backed by a research coalition including HKUST, M-A-P, Moonshot AI, ByteDance, 01.ai, and Geely, with an accompanying technical report published on arXiv in March 2025. ## Why It Matters Full-song generation with intelligible, on-lyric vocals has been one of the most visible capability gaps between open-source and commercial music AI. Tools like Suno demonstrated the demand, but their closed nature left researchers and builders unable to study, fine-tune, or self-host the technology. YuE narrows that gap with a transparent, permissively licensed architecture that produces vocals-plus-instrumentals end to end — making it a foundation others can extend rather than a black-box API. ## Limitations The headline constraint is compute: producing a complete song requires high-end accelerators and substantial generation time, which puts real-time or consumer-laptop use out of reach for now. Quality varies across the supported languages and genres, with less-represented styles receiving weaker coverage than the flagship English output. As with any expressive vocal-synthesis system, the voice-cloning capability in ICL mode carries clear potential for misuse, and outputs still benefit from human curation and post-production before release.