Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Dia is a 1.6B-parameter open-weights text-to-speech model from Nari Labs that specializes in something most TTS systems still struggle with — generating ultra-realistic two-speaker dialogue in a single pass. Released under the Apache 2.0 license, the project has racked up 19,294 GitHub stars and 1,683 forks, and a follow-up Dia2 model has already shipped with refined prosody and faster inference. Where conventional TTS stacks synthesize speakers one line at a time and rely on post-processing to glue them together, Dia models the entire dialogue jointly so turn-taking, overlap, and emotional consistency emerge from the same forward pass. ## Dialogue, Not Just Speech The core interface uses inline speaker tags — `[S1]` and `[S2]` — that must alternate correctly through the transcript. A typical prompt looks like "`[S1]` Did you read the new paper? `[S2]` (laughs) Of course I did." Dia handles the cadence, breath placement, and prosodic interaction between the two speakers natively, including realistic interruptions and back-channels. This is qualitatively different from concatenating two single-speaker XTTS or Fish-Speech outputs, where speakers always sound like they are reading separate scripts. ## Non-Verbal Vocalizations Dia accepts a small but expressive vocabulary of non-verbal cues inside the transcript: `(laughs)`, `(clears throat)`, `(sighs)`, `(gasps)`, `(coughs)`, `(singing)`, and similar. These tokens render as actual vocalizations rather than text, which is exactly the missing ingredient for game NPCs, audio drama, character voiceover, and accessibility content where pure speech feels flat. The README is candid that these tokens sometimes yield unexpected output and benefit from generation-time seed control or rerolls. ## Voice Cloning Without Fine-Tuning Voice conditioning works through audio prompts: the user supplies a short reference clip and Dia matches its timbre, pacing, and emotional register across the generated dialogue. There is no per-voice fine-tuning step — the model is fully zero-shot — which puts Dia in the same usage category as XTTS-v2 and Fish-Speech but with the unique advantage of conditioning two voices simultaneously from two reference clips. Speaker consistency across long generations does require seed fixing or audio prompts, since the model can drift without conditioning anchors. ## Hardware Footprint The model has been tested on PyTorch 2.0+ with CUDA 12.6, and on a single RTX 4090 it reaches roughly 2.1x real-time at bfloat16 with `torch.compile`, requiring about 4.4 GB of VRAM. That puts Dia within reach of consumer GPUs and makes it viable for indie game studios, podcast tooling, and creator workflows that cannot justify hosted inference. CPU support is documented as pending, and quantized or MLX builds for Apple Silicon are community-driven rather than first-party. ## Trade-Offs and Constraints Dia's specialization comes with sharp edges. Generation is English-only — there is no multilingual head — and the prompt length sweet spot is roughly 5 to 20 seconds of equivalent audio, so chapter-length narration requires chunking. The model is also strictly two-speaker; three- or four-way conversation requires interleaving multiple generations. Compared to VibeVoice's 90-minute multi-speaker mode or OmniVoice's 600-language coverage, Dia is the deliberately narrow specialist that nails the one thing — natural English dialogue with non-verbal cues — that most general models still mishandle. ## Ecosystem and Adoption The Apache 2.0 license, modest hardware footprint, and clear dialogue-focused niche have made Dia a popular component in larger pipelines: it is increasingly bundled into audio-drama generators, podcast-from-blog tools, AI tabletop game masters, and language-learning conversation engines. Nari Labs maintains an inference repository, a Hugging Face Space for quick previews, and a Discord for community contributions, and the upgrade path from Dia 1 to Dia2 is largely a weight swap. For builders who specifically need lifelike English dialogue rather than monologue TTS, Dia remains the open default in May 2026.