Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

Dia - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

Dia

Nari LabsApache-2.0

View on GitHub

TTS19.3K Stars1.7K Forks93 views

Dia is a 1.6B-parameter open-weights text-to-speech model from Nari Labs that specializes in something most TTS systems still struggle with — generating ultra-realistic two-speaker dialogue in a single pass. Released under the Apache 2.0 license, the project has racked up 19,294 GitHub stars and 1,683 forks, and a follow-up Dia2 model has already shipped with refined prosody and faster inference. Where conventional TTS stacks synthesize speakers one line at a time and rely on post-processing to glue them together, Dia models the entire dialogue jointly so turn-taking, overlap, and emotional consistency emerge from the same forward pass. ## Dialogue, Not Just Speech The core interface uses inline speaker tags — `[S1]` and `[S2]` — that must alternate correctly through the transcript. A typical prompt looks like "`[S1]` Did you read the new paper? `[S2]` (laughs) Of course I did." Dia handles the cadence, breath placement, and prosodic interaction between the two speakers natively, including realistic interruptions and back-channels. This is qualitatively different from concatenating two single-speaker XTTS or Fish-Speech outputs, where speakers always sound like they are reading separate scripts. ## Non-Verbal Vocalizations Dia accepts a small but expressive vocabulary of non-verbal cues inside the transcript: `(laughs)`, `(clears throat)`, `(sighs)`, `(gasps)`, `(coughs)`, `(singing)`, and similar. These tokens render as actual vocalizations rather than text, which is exactly the missing ingredient for game NPCs, audio drama, character voiceover, and accessibility content where pure speech feels flat. The README is candid that these tokens sometimes yield unexpected output and benefit from generation-time seed control or rerolls. ## Voice Cloning Without Fine-Tuning Voice conditioning works through audio prompts: the user supplies a short reference clip and Dia matches its timbre, pacing, and emotional register across the generated dialogue. There is no per-voice fine-tuning step — the model is fully zero-shot — which puts Dia in the same usage category as XTTS-v2 and Fish-Speech but with the unique advantage of conditioning two voices simultaneously from two reference clips. Speaker consistency across long generations does require seed fixing or audio prompts, since the model can drift without conditioning anchors. ## Hardware Footprint The model has been tested on PyTorch 2.0+ with CUDA 12.6, and on a single RTX 4090 it reaches roughly 2.1x real-time at bfloat16 with `torch.compile`, requiring about 4.4 GB of VRAM. That puts Dia within reach of consumer GPUs and makes it viable for indie game studios, podcast tooling, and creator workflows that cannot justify hosted inference. CPU support is documented as pending, and quantized or MLX builds for Apple Silicon are community-driven rather than first-party. ## Trade-Offs and Constraints Dia's specialization comes with sharp edges. Generation is English-only — there is no multilingual head — and the prompt length sweet spot is roughly 5 to 20 seconds of equivalent audio, so chapter-length narration requires chunking. The model is also strictly two-speaker; three- or four-way conversation requires interleaving multiple generations. Compared to VibeVoice's 90-minute multi-speaker mode or OmniVoice's 600-language coverage, Dia is the deliberately narrow specialist that nails the one thing — natural English dialogue with non-verbal cues — that most general models still mishandle. ## Ecosystem and Adoption The Apache 2.0 license, modest hardware footprint, and clear dialogue-focused niche have made Dia a popular component in larger pipelines: it is increasingly bundled into audio-drama generators, podcast-from-blog tools, AI tabletop game masters, and language-learning conversation engines. Nari Labs maintains an inference repository, a Hugging Face Space for quick previews, and a Discord for community contributions, and the upgrade path from Dia 1 to Dia2 is largely a weight swap. For builders who specifically need lifelike English dialogue rather than monologue TTS, Dia remains the open default in May 2026.

Key Features

Single-pass two-speaker dialogue generation via [S1]/[S2] inline speaker tags
Native non-verbal vocalizations: laughs, sighs, coughs, gasps, clears throat, singing
Zero-shot voice cloning from short audio prompts with two simultaneous voices
1.6B-parameter open-weights model with a refined Dia2 follow-up release
Apache 2.0 license enabling commercial use and derivative work
Runs at ~2.1x real-time on a single RTX 4090 with ~4.4 GB VRAM at bfloat16
Active Hugging Face Space and Discord community for previews and contributions
Tight focus on dialogue rather than monologue, complementary to VibeVoice and OmniVoice

Related Projects

TrendingTTS

GitHub

58.9K6.4K

GPT-SoVITS

RVC-Boss

MIT53

Open Source

Dia

Key Features

Tags

Related Projects

GPT-SoVITS

VibeVoice

ChatTTS

Bark