Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
Supertonic is an open-source on-device text-to-speech engine from Supertone Inc that ships a 99-million-parameter ONNX model capable of generating speech in 31 languages without ever calling a cloud API. Released under an MIT code license and an OpenRAIL-M model license, Supertonic has climbed to 6,000+ GitHub stars and 599 forks in roughly six months, becoming the reference implementation for compact, privacy-first TTS that actually runs everywhere from a Raspberry Pi to a WebGPU browser tab. ## What Supertonic Is Most popular open TTS systems in 2026 sit in the 0.7B to 2B parameter range and assume a GPU or hosted inference endpoint. Supertonic takes the opposite bet, compressing a high-quality multilingual TTS model down to roughly 99M parameters so it can run natively via ONNX Runtime on phones, laptops, e-readers, and even microcontroller-class hardware. The result is a system where audio is generated entirely on the user's device, with no network round trip, no telemetry, and no recurring per-character billing. ## Architecture and Novel Techniques Supertonic introduces two architectural ideas worth flagging. The first is self-purifying flow matching, a training procedure that lets the model learn cleanly even from noisy or imperfectly labeled speech corpora, which matters enormously when you scale to 31 languages with uneven data quality. The second is Length-Aware RoPE, a positional-encoding tweak in the cross-attention path between text and speech that improves alignment for long and complex utterances. Together they enable the small parameter count to punch well above its weight on word and character error rate benchmarks, with results competitive against open models like VoxCPM2 that are an order of magnitude larger. ## Real-World Reading Accuracy The team puts particular emphasis on text normalization. Supertonic correctly reads financial expressions like $5.2M, phone numbers in the format (212) 555-0142 ext. 402, scientific units such as 2.3h and 30kph, and other strings that hosted services from major vendors still mispronounce. Expression tags like <laugh>, <breath>, and <sigh> can be embedded inline to inject natural prosody, giving long-form narration a more human feel than typical flat TTS outputs. ## Cross-Platform Deployment Supertonic's most striking property is its deployment surface. The repository ships reference implementations in Swift for iOS and macOS, C# and C++ for Windows, Python, Rust, Go, and Java for Linux servers, Node.js for embedded JavaScript runtimes, and onnxruntime-web for browsers using WebGPU or WASM. A Raspberry Pi can hit roughly 0.3x real-time factor, and the team has demonstrated Supertonic running offline on an Onyx Boox e-reader in airplane mode, a deployment target that effectively no other modern neural TTS can claim. ## Voice Builder and Customization Launched in January 2026, the Voice Builder feature converts a user's recorded voice samples into a deployable, edge-native TTS profile that the user owns permanently. This is a meaningful contrast to cloud voice-cloning services that retain rights to the cloned voice or require ongoing subscription access. Voice Builder targets accessibility users, podcasters, and indie developers who want a personalized voice without surrendering it to a vendor. ## Limitations Supertonic's compact size has real trade-offs. Naturalness still trails the largest hosted services on highly expressive, emotionally varied long-form reading, and the model can produce occasional speaker-similarity drift on out-of-distribution voices. The OpenRAIL-M model license also adds use-case restrictions that pure MIT TTS projects do not, so commercial deployments need legal review. Native Android support is not yet first-class in the repository, with most mobile examples currently focused on iOS and Flutter.
microsoft
Open-source frontier voice AI for TTS and ASR
resemble-ai
Family of SoTA open-source TTS models by Resemble AI with zero-shot voice cloning, 23+ language support, and paralinguistic controls across 350M-500M parameter variants.