Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
SongGeneration is the official open-source code repository for LeVo, Tencent AI Lab's high-quality song generation system built around multi-preference alignment. With 1,654 GitHub stars and 198 forks, the project packages a full pipeline for turning structured lyrics and style tags into complete songs — vocals, accompaniment, or both — and aims squarely at commercial-grade output from an open release. ## What LeVo Is LeVo is a music foundation model that generates full songs from text. Unlike one-shot clip generators, it is designed to handle song-level structure: users supply lyrics organized by section labels such as [intro], [verse], [chorus], [bridge], and [outro], along with a description of gender, genre, emotion, and instrumentation expressed as comma-separated tags. From that input the system produces a coherent arrangement up to four minutes thirty seconds long, and can emit a full mixed song, pure instrumental music, vocals-only a cappella, or separated tracks for downstream editing. ## A Hybrid LM-Diffusion Architecture The technical core is a hybrid language-model-plus-diffusion design. A component the authors call LeLM acts as a hierarchical language model that manages global musical structure and performance details, processing what the project describes as Mixed Tokens for semantics and Dual-Track Tokens for parallel modeling of vocals and accompaniment. A separate diffusion component then synthesizes the high-fidelity acoustic details. This split — a planner that decides structure and a renderer that produces audio frames — mirrors a broader 2026 trend in audio AI toward decoupling musical composition from waveform synthesis, and it lets LeVo model vocals and backing tracks as parallel streams rather than a single entangled signal. ## Multi-Preference Alignment The defining contribution, reflected in the paper title "High-Quality Song Generation with Multi-Preference Alignment," is the training pipeline. LeVo uses a three-stage process: supervised fine-tuning on high-quality songs, a large-scale offline Direct Preference Optimization (DPO) stage using roughly 200,000 positive/negative pairs, and a semi-online DPO stage with periodic updates driven by aesthetic scores. Preference alignment, familiar from text LLM training, is here applied to musical quality — teaching the model not just to produce valid audio but to prefer outputs that human-and-aesthetic scoring rate as better across multiple dimensions including overall quality, melody, arrangement, sound quality, and structure. ## Reported Quality The project reports a phoneme error rate (PER) of 8.55% for lyric accuracy, which it positions favorably against a cited 12.4% for a commercial reference. As with all self-reported benchmarks these figures should be read as the authors' own evaluation rather than independent results, but the emphasis on lyric intelligibility is notable: getting sung words to be clearly understandable is a long-standing weakness of generative music systems, and a dedicated PER metric signals that the team treats it as a first-class objective. ## Multilingual and Practical Use The current v2-large model supports multiple languages including Chinese, English, Spanish, and Japanese. In practice, input is provided as JSON Lines (.jsonl) files where lyrics use English punctuation only, sections are separated by semicolons, and sentences end with periods; descriptions are comma-separated tags rather than full sentences. Runtime flags such as `--low_mem`, `--not_use_flash_attn`, `--separate`, `--vocal`, and `--bgm` let users tune memory usage and control which stems are produced. The project requires Python 3.8.12 or newer and CUDA 11.8 or newer, and provides a Docker container, Hugging Face models, and an interactive demo to ease setup. ## Model Variants and Footprint LeVo ships in several variants that trade length and quality against GPU memory. Smaller base configurations target roughly 10-16GB of GPU memory for shorter songs, while the Large and v2-large configurations target the 22-28GB range for full four-and-a-half-minute generations, with real-time factors below 1.0 meaning generation is faster than playback. This tiering lets users with mid-range GPUs run shorter outputs while reserving the largest model for users with high-memory cards. ## Pros, Cons, and Outlook LeVo's strengths are its song-level structure handling, dual-track vocal/accompaniment modeling, multilingual support, and a genuinely novel preference-alignment training recipe backed by a published paper. The trade-offs are real: the largest models demand 22-28GB of GPU memory, the input format is strict and unforgiving of punctuation mistakes, and the repository's license is marked NOASSERTION on GitHub, meaning prospective commercial users must read the bundled LICENSE file carefully rather than assuming a standard permissive grant. Backed by Tencent AI Lab and centered on a credible research contribution, SongGeneration is one of the more academically grounded entries in the 2026 open-source song-generation landscape, and its multi-preference alignment approach is likely to influence how future systems are tuned for musical quality.