Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

Open-LLM-VTuber - Open Source | Evermx | Evermx

Back to Open Source

Trending

Open-LLM-VTuber

Open-LLM-VTuberOther

View on GitHub

Multimodal9.0K Stars1.1K Forks71 views

Open-LLM-VTuber is a Python project that turns any LLM into a hands-free voice companion with a Live2D animated avatar that runs locally on Windows, macOS, and Linux. With 9,000+ GitHub stars and 1,100+ forks, it sits at the intersection of voice agents, on-device LLM use, and the long tail of AI companion tooling, and it is one of the few open projects that wires real-time ASR, LLM inference, TTS, and an expressive 2D avatar into a single configurable pipeline. ## What It Actually Ships The project is not a chat UI bolted onto Whisper. It is a modular voice-loop runtime where each stage is swappable. Speech recognition can be Sherpa-onnx, FunASR, Faster-Whisper, Whisper.cpp, Azure ASR, or Groq Whisper. LLM inference can be Ollama, vLLM, LM Studio, a local GGUF model, or a remote OpenAI/Gemini/Claude/Mistral/DeepSeek endpoint. Speech synthesis can be Sherpa-onnx, MeloTTS, Coqui-TTS, GPTSoVITS, Fish Audio, Edge TTS, Azure TTS, or pyttsx3. The Live2D layer takes emotion tags emitted by the LLM and maps them to facial expressions on a customizable model. ## Voice Interruption and Visual Perception Two features set the runtime apart from generic voice chat scripts. The first is voice interruption with echo handling: the system actively prevents the model from hearing its own TTS output, so users can cut the assistant off mid-sentence without triggering a feedback loop. The second is visual perception, which lets the agent take screenshots, share the screen, or read from a webcam, then pass those frames to a multimodal backend for grounding. Combined with persistent chat history, this turns the avatar from a tech demo into something usable for coding companion, language practice, or accessibility scenarios. ## Local-First, Offline-Capable The defining design goal is that the entire stack can run without an internet connection. With Ollama or a local GGUF model handling generation, Sherpa-onnx or Faster-Whisper handling ASR, and Sherpa-onnx or MeloTTS handling synthesis, the loop never touches a remote API. That is unusual in the AI companion space, where most consumer apps assume a cloud LLM and a cloud TTS endpoint. For users who care about latency, privacy, or running on a plane, this is the actual differentiator. ## Two Frontends A browser-based web client handles the bulk of usage, but the project also ships a desktop client with a transparent pet mode that lets the Live2D model float over the user's desktop. Character appearance, persona prompt, and voice can all be swapped, including bring-your-own Live2D models and cloned voices via the GPTSoVITS or Fish Audio backends. ## Realistic Use Cases The most credible deployments are language practice partners, coding ride-alongs that read the screen and narrate, and accessibility front-ends that turn voice into structured LLM calls. The project does not pretend to ship a finished commercial product, but the modular config means a competent engineer can stand up a working private voice agent over a weekend. ## Limitations Long-term memory is currently disabled, which is a real gap for a companion-style use case where session continuity is the point. The licensing is listed as Other (NOASSERTION) rather than a clean OSI license, which complicates redistribution and commercial reuse. Live2D model assets themselves are not redistributable under most commercial licenses, so users who want a custom avatar will need to source one separately. Finally, the runtime is fundamentally I/O bound on consumer hardware: a quantized 7B model plus Whisper plus a neural TTS engine will saturate a mid-range GPU and push response latency into the second-or-more range, which is workable but not yet at the level of a polished commercial voice assistant. As a reference architecture for what an open, local, multimodal voice agent actually looks like end to end, Open-LLM-VTuber is one of the most complete public examples available.