dots.tts — Demo Page
dots.tts
A 2B-parameter fully continuous, end-to-end autoregressive text-to-speech system.
Abstract<br>dots.tts is a 2B-parameter fully continuous , end-to-end<br>autoregressive (AR) text-to-speech system. The backbone pairs a semantic encoder ,<br>an LLM , and an autoregressive flow-matching acoustic head over<br>a 48 kHz AudioVAE , with no discrete tokens anywhere in the pipeline.
dots.tts achieves the best average performance on Seed-TTS-Eval,<br>with WERs of 0.94% / 1.30% / 6.60% and SIM scores of 81.0 / 77.1 / 79.5<br>on the zh / en / zh-hard test sets, respectively. It further attains the highest average speaker similarity<br>(83.9 ) on the 24-language MiniMax multilingual benchmark. Across other benchmarks,<br>dots.tts also consistently demonstrates open-source state-of-the-art<br>performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness.
Contents
Overview
Evaluation
Monolingual & Cross-Lingual Voice Cloning
Context-Aware Expressive Voice Cloning
Overview
Evaluation
Monolingual & Cross-Lingual Voice Cloning
Context-Aware Expressive Voice Cloning