Reviews AI Tools Open Source Live News AI Official

Open Source

Explore the latest AI open-source projects from GitHub and HuggingFace.

Sesame CSM - Open Source | Evermx | Evermx

Back to Open Source

TrendingFeatured

Sesame CSM

Sesame AI LabsApache-2.0

View on GitHub

TTS14.5K Stars1.5K Forks253 views

CSM (Conversational Speech Model) is an open-source speech generation model released by Sesame AI Labs. It generates audio from text and audio inputs by producing RVQ (Residual Vector Quantization) audio codes through a Llama-backbone architecture paired with a smaller audio decoder that outputs Mimi audio codes. The model has attracted 14,500 GitHub stars and 1,500 forks, with the 1B-parameter checkpoint hosted on Hugging Face under the Apache 2.0 license. CSM stands out in the text-to-speech landscape for generating audio that sounds genuinely conversational. Early demos gained widespread attention for reproducing natural pauses, realistic intonation shifts, and speech rhythm that closely resembles actual human conversation. CSM is now natively integrated into Hugging Face Transformers as of version 4.52.1. ## Llama-Based Architecture for Speech CSM's core innovation is applying the Llama language model architecture to audio generation. The system uses a large Llama backbone for high-level sequence modeling, then passes the output through a smaller, specialized audio decoder that produces Mimi audio codes. This two-stage architecture leverages the sequence modeling strengths of transformer-based LLMs while keeping the audio decoding component lightweight and efficient. ## Context-Aware Speech Generation CSM's quality improves meaningfully when provided with conversational context. By passing previous turns as audio context via the Segment object API, the model adapts its generated speech to better match the tone, pacing, and emotional register of the exchange. ## Multi-Speaker Support CSM produces varied, distinguishable voices across speakers without requiring per-speaker fine-tuning. The model can generate audio for multiple characters in a conversation with consistent voice differentiation throughout the exchange. ## Hugging Face Transformers Integration Native integration into Hugging Face Transformers as of version 4.52.1 means CSM can be loaded through the standard Transformers API. For teams with existing Transformers-based pipelines, adding CSM-based speech synthesis requires minimal additional engineering. ## Accessible Hardware Requirements At 1B parameters, CSM is sized to run on a mid-range CUDA-compatible GPU (tested on CUDA 12.4 and 12.6). This makes the model accessible to individual researchers and small teams without requiring cloud GPU clusters. Sesame's stated roadmap includes expansion to 20+ languages in future releases.

Key Features

Llama backbone architecture for natural conversational speech generation with human-like intonation
1B-parameter model runs on mid-range CUDA GPU (CUDA 12.4/12.6 tested)
Context-aware generation via Segment API — adapts speech to conversational history
Multi-speaker generation without per-speaker fine-tuning
Native Hugging Face Transformers integration (v4.52.1+) via standard pipeline API
RVQ audio code generation with Mimi audio decoder for high-quality output
Apache 2.0 license for research and commercial deployment
Roadmap: 20+ language support planned for future releases

Related Projects

TrendingTTS

GitHub

58.9K6.4K

GPT-SoVITS

RVC-Boss

MIT60

Open Source

Sesame CSM

Key Features

Tags

Related Projects

GPT-SoVITS

VibeVoice

ChatTTS

Bark