Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

WhisperX - Open Source | Evermx | Evermx

Back to Open Source

Trending

WhisperX

m-bainBSD-2-Clause

View on GitHub

STT22.7K Stars2.3K Forks2 views

WhisperX is an open-source automatic speech recognition (ASR) system that wraps OpenAI's Whisper to deliver fast transcription with accurate word-level timestamps and speaker diarization. With over 22,000 GitHub stars, it has become the go-to choice for developers who need not just *what* was said, but *exactly when* each word was spoken and *who* said it — the foundation for subtitles, meeting notes, and audio search. ## Word-Level Timestamps Vanilla Whisper produces timestamps only at the segment level, which is often too coarse for subtitling or word-accurate editing. WhisperX adds a forced phoneme alignment stage using a separate wav2vec2 model, snapping each transcribed word to its precise position in the audio. The result is timing accurate enough to drive karaoke-style captions or to cut audio on exact word boundaries. ## 70x Real-Time Speed Speed is a defining feature. By combining voice-activity detection (VAD) with batched inference on the faster-whisper backend, WhisperX transcribes audio at up to 70x real-time using the large-v2 model. VAD-based segmentation also reduces hallucination on silent passages, and the faster-whisper engine keeps memory modest — large-v2 runs in under 8GB of GPU memory with a beam size of 5. ## Speaker Diarization For multi-speaker recordings, WhisperX integrates pyannote-audio to label who spoke each segment. This makes it well suited to interviews, podcasts, and meetings, where separating speakers is as important as the transcript itself. The diarization output is merged back into the word-level transcript so every word carries both a timestamp and a speaker tag. ## Practical Use WhisperX is distributed as a Python package and command-line tool, making it easy to drop into existing pipelines. It supports multiple languages, exports common subtitle formats such as SRT and VTT, and can run on consumer GPUs. The accompanying research paper documents the alignment and batching approach for those who want to understand the internals. ## Considerations The code is released under the permissive BSD-2-Clause license, but the speaker diarization step depends on pyannote models that require accepting their own license and a Hugging Face access token, which adds a setup step. As a research-oriented wrapper, it also inherits Whisper's limitations on heavily accented or noisy audio. Even so, for anyone building transcription with precise timing and speaker labels, WhisperX is among the most capable and widely adopted open options available.