Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
RealtimeSTT is the MIT-licensed Python library that has become the de facto building block for live speech interfaces in open-source projects. At 9,900+ GitHub stars and the v1.0.2 release on May 31, 2026, it sits between the microphone and a transcription engine and gives developers the parts that are tedious to build correctly: voice activity detection, wake word activation, real-time partial transcripts, and a FastAPI streaming server. ## What RealtimeSTT Actually Does The library does not ship its own ASR model. Instead it wires a microphone (or any audio stream) into a pluggable transcription backend, with faster_whisper as the recommended default. Around that core it adds the production glue: a VAD layer to detect when the user starts and stops speaking, an optional wake word layer to trigger recording only on a keyword, and a callback system that surfaces intermediate and final transcripts to the host application. The combination is what most voice assistants and live caption projects end up reimplementing, which is why the project has become the standard dependency for this layer. ## Dual VAD: WebRTC and Silero Voice activity detection determines when the recorder starts capturing and when the user has finished speaking. RealtimeSTT supports both WebRTC VAD, which is lightweight and CPU-friendly, and Silero VAD, which is more accurate in noisy environments. Developers can switch between them based on the deployment target, or use them in combination for cascaded sensitivity. Tunable parameters expose the silence thresholds and post-speech padding that determine perceived latency. ## Wake Word Activation Optional wake word support lets the library stay dormant until a keyword is spoken. Two providers are supported: Porcupine for commercial-grade keyword detection and OpenWakeWord for fully open-source operation. This is the layer that turns RealtimeSTT into the foundation for always-on voice assistants without keeping the transcription engine loaded continuously. ## Pluggable Transcription Engines The transcription layer is engine-agnostic. faster_whisper is the recommended default for accuracy and speed, but the library also wraps whisper.cpp, OpenAI Whisper, Moonshine, sherpa-onnx, Kroko-ONNX, and several transformer-based backends. This means a project can start with faster_whisper on a development machine and swap to Moonshine for a Raspberry Pi deployment without rewriting the audio pipeline. ## FastAPI Streaming Server The v1 line ships a FastAPI server that exposes RealtimeSTT over WebSockets with multi-user session isolation. Each connected browser gets its own recorder, VAD state, and transcription queue, which makes the server suitable for multi-tenant deployments without a separate per-user process. The server is the path most teams take when integrating RealtimeSTT into a web-based product rather than embedding it in a desktop application. ## Event Callbacks The API exposes lifecycle callbacks for recording start and stop, VAD activation, partial transcripts, final transcripts, and wake word detection. This is what allows the library to drive UI affordances (microphone indicators, partial text rendering, end-of-utterance signals) without polling. For voice agent applications the partial transcript callback is what enables the model to start reasoning before the user finishes speaking. ## Limitations RealtimeSTT is a glue library, not an ASR model. Transcription quality is bounded by the chosen engine, and tuning faster_whisper for a specific accent, language, or noise profile is the developer's responsibility. PortAudio is a hard dependency on Linux and macOS, which adds a system package step to deployment and occasionally breaks in container builds. Multi-user scaling via the FastAPI server is bounded by the underlying GPU or CPU running the ASR engine, so high-concurrency deployments still need a separate model-serving layer behind the streaming front end. The library does not implement diarization, so multi-speaker meeting transcription requires a separate post-processing stage. Finally, wake word accuracy is gated by the chosen provider and the training audio for the keyword, so custom hot words may need substantial tuning before they hit production reliability. Within those caveats, RealtimeSTT is the most efficient way in 2026 to add voice-in to a Python project without writing the audio plumbing from scratch.