Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Voicebox is a local-first, open-source AI voice studio that brings the entire voice input/output stack onto a single machine. Positioned as a free alternative to ElevenLabs and WisprFlow combined, it lets users clone a voice from a few seconds of reference audio, synthesize speech across 23 languages and seven TTS engines, dictate into any text field with a global hotkey, and give MCP-aware AI agents a voice the user actually owns. With more than 33,000 GitHub stars under the MIT license, it has quickly become a reference point for privacy-respecting voice tooling. ## A Complete Local Voice Loop The two dominant cloud incumbents each cover only half of the voice loop: ElevenLabs handles output (text-to-speech), while WisprFlow handles input (dictation). Voicebox unifies both halves and runs them locally. Audio captures, cloned voice data, and the underlying models never leave the user's machine, which makes the project attractive for anyone with privacy, compliance, or offline requirements. A bundled local LLM ties the two halves together, powering refinement modes and per-profile personas without any external API call. ## Seven TTS Engines Under One Roof Rather than betting on a single model, Voicebox ships seven switchable TTS engines, each with different strengths. Qwen3-TTS (0.6B and 1.7B) delivers high-quality multilingual cloning with natural-language delivery instructions like "speak slowly" or "whisper." Qwen CustomVoice offers curated preset voices with no reference audio required. LuxTTS is a lightweight English engine that runs at roughly 150x realtime on CPU using about 1GB of VRAM. Chatterbox Multilingual covers the broadest language range, including Arabic, Hindi, Swahili, Turkish, and more, while Chatterbox Turbo adds paralinguistic emotion tags such as [laugh], [sigh], and [gasp]. Kokoro and HumeAI TADA round out the lineup with additional preset voices and expressive styles. ## Voice Cloning, Dictation, and Agent Speech Voicebox supports zero-shot voice cloning from a short sample as well as 50+ curated preset voices. On the input side, a global dictation hotkey with push-to-talk and toggle modes feeds Whisper-based speech-to-text into any application, with accessibility-verified auto-paste on macOS. Perhaps its most forward-looking feature is agent voice output: a single MCP tool call (voicebox.speak) lets any MCP-aware agent — Claude Code, Cursor, or Cline — respond aloud in a cloned voice. Free-form personas can be attached to any voice profile, then invoked through Compose, Rewrite, or Respond modes driven by the bundled local LLM. ## Production Features and Native Performance Beyond raw synthesis, Voicebox includes post-processing effects (pitch shift, reverb, delay, chorus, compression, filters), unlimited-length generation via auto-chunking with crossfade, and a multi-track Stories editor for podcasts, conversations, and narration. It exposes a REST API plus a built-in MCP server so developers can wire voice I/O into their own apps and agents. The desktop application is built with Tauri and Rust rather than Electron, giving it a small footprint and native performance, and it runs across macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, and Intel Arc. ## Considerations As a young and fast-moving project, Voicebox is still filling gaps: Linux currently requires building from source because pre-built binaries are not yet available, and the breadth of seven engines means newcomers face some choice paralysis when picking the right model for a task. Local inference quality and speed also depend heavily on available GPU or accelerator hardware. Even so, for users who want a single, private, end-to-end voice studio without recurring cloud subscriptions, Voicebox covers an unusually complete feature surface.