Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
speaches is an open-source, self-hosted server that speaks the OpenAI Audio API on the wire while running fully local, faster-whisper-powered speech-to-text underneath. Released by speaches-ai under the MIT license, the project has crossed 3,300 GitHub stars and 398 forks by becoming the no-friction drop-in for teams who want OpenAI Whisper, OpenAI TTS, and OpenAI translation endpoints without sending audio to OpenAI. The same server also exposes piper and Kokoro for text-to-speech, so a single container handles transcription, translation, and synthesis behind one OpenAI-shaped URL. ## Why a Drop-in OpenAI Audio Replacement Matters in 2026 The Whisper API is, by an enormous margin, the most copied audio interface in the industry. Almost every voice tool released since 2023 either targets it directly or supports it as a fallback. That is exactly what speaches exploits. Existing client code that already calls `audio.transcriptions.create` keeps working unchanged, you just point the base URL at the speaches container. There is no SDK to learn, no proprietary schema to map, and no vendor-side privacy review to pass before pilot. For regulated industries, regions with data-residency rules, or anyone trying to cut per-minute audio costs, that compatibility is the entire pitch. ## Streaming Transcription Without the Whisper Latency Penalty Classic Whisper is batch-oriented and not built for live captioning. speaches leans on faster-whisper, a CTranslate2 reimplementation of Whisper that runs roughly 4x faster than the reference model on the same GPU and as much as 2x faster on CPU. On top of that, the server exposes a true streaming transcription endpoint so partial transcripts arrive while the speaker is still talking. Combined with the project's Docker images for CUDA, CPU, and ROCm, the same deployment handles batch podcast jobs at night and live meeting captions during the day. ## Three Audio Workloads, One Container Most teams that adopt speaches do so to replace three separate services with one process. Transcription is handled by faster-whisper, with model selection from tiny through large-v3 controlled by an environment variable. Translation reuses Whisper's built-in translation mode, exposed through the OpenAI translations endpoint. Speech generation is provided by piper for low-latency neural voices and Kokoro for higher-quality natural reads, both reachable through the OpenAI TTS API. The unified surface removes the usual reverse-proxy gymnastics that come with mixing three different audio vendors. ## Deployment, Docker, and Docker Compose The repository ships first-class Docker and Docker Compose support, which is reflected in its topic tags. A single `docker compose up` brings the server online, downloads the configured Whisper model on first run, and exposes the OpenAI-compatible HTTP API on a local port. The project documents GPU passthrough for NVIDIA and AMD, CPU-only fallbacks for small models, and environment-variable knobs for concurrency, model caching, and quantization. Production deployments typically front speaches with nginx or a load balancer and scale horizontally by adding more containers. ## Cost and Privacy Math For a team transcribing thousands of hours of audio per month, the OpenAI Whisper API at its current per-minute rate can run into five figures monthly. A single mid-range GPU running speaches with faster-whisper large-v3 can match that throughput at the cost of the hardware plus electricity. Audio never leaves the network boundary, which removes the BAA/DPA paperwork that often blocks healthcare, finance, and legal teams from using cloud audio APIs at all. ## Limitations speaches is honest about what it is and is not. It is not a Whisper accuracy improvement, the underlying model is still Whisper, so any benchmark deltas are about throughput and integration, not WER on hard accents. The OpenAI Audio API surface is also a moving target, and speaches lags new endpoints by some weeks. Speaker diarization is not built in, and for very low-latency live captioning workloads the project recommends pairing with a VAD frontend rather than relying on the streaming endpoint alone. ## Who Should Use speaches speaches is the right choice for engineering teams that already have OpenAI Audio API clients in production and want to flip a base URL to bring inference in-house, for regulated industries that cannot send raw audio to a U.S. cloud provider, and for cost-sensitive operators running large batch transcription workloads. Hobbyists who want a single Docker container that handles transcription, translation, and TTS through one familiar API will also find speaches faster to adopt than wiring three separate projects together.