Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction Speaches is an OpenAI API-compatible server for streaming transcription, translation, and speech generation. Positioned as "Ollama, but for TTS/STT models," Speaches has accumulated 3,100+ GitHub stars, 34 contributors, and an MIT license. The project bridges the gap between local speech model deployment and the standardized OpenAI API ecosystem, allowing any tool or SDK built for OpenAI's speech endpoints to work seamlessly with self-hosted models. For teams that need speech capabilities without sending audio data to external APIs, Speaches provides a drop-in replacement that runs entirely on local infrastructure with both GPU and CPU support. ## Architecture and Design Speaches is built around the principle of API compatibility. Rather than inventing a new interface, it mirrors OpenAI's speech endpoints exactly, so existing client code works without modification. | Component | Technology | Purpose | |-----------|-----------|--------| | Speech-to-Text | faster-whisper | Streaming transcription and translation | | Text-to-Speech | Kokoro (TTS Arena #1) | High-quality speech synthesis | | Text-to-Speech | Piper | Lightweight, fast speech generation | | API Layer | FastAPI | OpenAI-compatible REST endpoints | | Deployment | Docker / Docker Compose | Containerized deployment | **Dynamic Model Loading**: Unlike static model servers, Speaches loads and offloads models on demand. Specify the model name in your API request, and it will be downloaded and loaded automatically. This eliminates the need for pre-configuration and reduces idle memory usage. **Streaming Transcription**: Audio transcription is streamed via Server-Sent Events (SSE) as the audio is processed. There is no need to wait for the entire audio file to be transcribed before receiving results, which is critical for real-time applications. ## Key Capabilities **Text-to-Speech**: Generate spoken audio from text using Kokoro, the #1 ranked model on TTS Arena, or Piper for lighter-weight synthesis. Supports multiple voices and languages. **Speech-to-Text**: Transcribe audio in real-time using faster-whisper, an optimized Whisper implementation. Supports streaming, batch processing, and multilingual transcription. **Audio-to-Audio**: Handle speech-to-speech interactions where audio input is processed and audio output is generated, enabling conversational AI applications. **Chat Completions with Audio**: Generate spoken audio summaries from text inputs through the chat completions endpoint, combining language understanding with speech synthesis. **Sentiment Analysis on Audio**: Process audio recordings to extract text-based analysis, enabling applications like call center analytics and meeting summarization. **Realtime API**: WebSocket-based realtime API for low-latency speech interactions, matching OpenAI's Realtime API specification. ## Deployment Speaches deploys via Docker with a single command: ```bash docker compose up ``` The server then accepts requests at the standard OpenAI API endpoints: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1") audio = client.audio.speech.create( model="kokoro", voice="af_heart", input="Hello from Speaches!" ) ``` ## Limitations The latest release (v0.9.0-rc.3) is still a release candidate, indicating the API surface may change. GPU support requires NVIDIA CUDA drivers and compatible hardware. Model quality depends on the underlying engines (faster-whisper, Kokoro, Piper) and may not match OpenAI's proprietary models in all scenarios. Dynamic model loading adds latency on first request as models are downloaded. The project has 89 open issues, reflecting active development but also unresolved edge cases. Real-time streaming adds complexity to error handling compared to batch processing. ## Who Should Use This Speaches is ideal for teams that need OpenAI-compatible speech APIs but must keep audio data on-premises for privacy or compliance reasons. Developers building voice-enabled applications who want a local development environment without API costs will find it invaluable. Organizations running Ollama for LLM inference can pair Speaches for a complete local AI stack covering text and speech. Startups prototyping conversational AI products benefit from the zero-cost, self-hosted model.