owensong/Inflect-Nano-v1 路 Hugging Face
Log In<br>Sign Up
Inflect-Nano-v1
Edit 06/17/2026 -- I'm really happy to see that this model is doing decently! If more people find it useful, I might consider training a v2 with a much larger budget.<br>Inflect-Nano is #3 trending on Hugging Face's TTS leaderboard! Can it get any higher? If you would like to see a v2, just like/favourite this model to get more people see it. Thank you for everyone for checking out this model!
Inflect-Nano-v1 is a tiny English text-to-speech model with 4.63M total inference parameters, including its vocoder.
It is not trying to beat large TTS models. It is a small, local, complete text-to-waveform stack built to test how far ultra-lightweight speech synthesis can go.
Highlights
4.63M parameters total
Includes the vocoder
24 kHz audio
Single English male voice
Runs locally with PyTorch
Built for tiny-model experiments, local assistants, embedded demos, and efficient inference research
Listen
Text<br>Audio
"Did the timing change?" she answered. "Then why did Logan leave?"
Who puts a parking meter next to an ER label?
Please say neighborhood, statistics, and anesthesiologist clearly, without rushing through the middle syllables.
I said 91, not 306, which is a very different number.
The inference path looked natural, but the decoder still needed a smoother transition before Marcus approved the final test.
The appointment moved to 1:25, the invoice was $674.96, and the archive was labeled 1998.
If Logan sounded uneasy, then it happened near Long Beach, and the pause has to carry that.
The word aluminum should not steal attention from the softer ending after entrepreneur.
Install
git clone https://huggingface.co/owensong/Inflect-Nano-v1<br>cd Inflect-Nano-v1<br>pip install -r requirements.txt
Generate Speech
python inference.py --text "Wait, are you actually being for real now?" --out sample.wav
CPU:
python inference.py --device cpu --text "Please say neighborhood clearly." --out sample_cpu.wav
With simple controls:
python inference.py \<br>--text "The appointment moved to 1:25." \<br>--length-scale 1.03 \<br>--pitch-scale 1.00 \<br>--energy-scale 1.00 \<br>--out sample_controlled.wav
Local Gradio demo:
python app.py
Model Size
Part<br>Parameters
Acoustic model<br>3.465M
Vocoder generator<br>1.167M
Total inference stack<br>4.632M
The model files are:
weights/inflect_nano_v1_acoustic.pt<br>weights/inflect_nano_v1_vocoder.pt
Repo Layout
weights/ model weights<br>examples/ audio examples<br>assets/ README banner<br>inflect_nano/ runtime model code<br>third_party/tiny_tts_frontend/ vendored text frontend used for English G2P/token IDs<br>inference.py simple CLI inference<br>app.py local Gradio demo
The model itself is in weights/. The vendored frontend is included only so the released model can reproduce the same text normalization and tokenization path.
What Makes It Different
Many small TTS projects depend on a separate larger vocoder. Inflect-Nano-v1 includes the vocoder in the published inference stack, so the full text-to-waveform path stays under 5M parameters.
Pipeline:
text<br>-> English text frontend<br>-> compact FastSpeech-style acoustic model<br>-> 80-bin mel spectrogram<br>-> small Snake HiFi-GAN-style vocoder<br>-> 24 kHz waveform
Architecture
The acoustic model is a compact non-autoregressive FastSpeech-style network. It predicts duration, pitch, energy, and brightness, then decodes an 80-bin mel spectrogram.
The vocoder is a small Snake-activation HiFi-GAN-style generator trained for 24 kHz waveform reconstruction.
Main settings:
Setting<br>Value
Sample rate<br>24 kHz
Mel bins<br>80
Acoustic hidden size<br>168
Encoder layers
Decoder layers
Vocoder upsample rates<br>8, 8, 2, 2
Good For
Tiny local TTS experiments
Offline assistant prototypes
Efficient inference research
Embedded speech demos
Browser/WASM-style exploration
A baseline for sub-5M TTS work
Not Good For
Production narration
Accessibility-critical output
Voice cloning
Multilingual speech
High-fidelity audiobook generation
Matching large modern TTS systems
Limitations
This is a very small experimental model. It can sound robotic, buzzy, or unstable, especially on difficult unseen text. Long prompts and unusual phrasing are less reliable. The vocoder is also a clear quality bottleneck.
Use it as a tiny-model research/demo release, not as a production TTS engine.
License
Apache-2.0.
This repository includes a small third-party English text frontend for tokenization/G2P compatibility. Its license is included at third_party/tiny_tts_frontend/LICENSE.
Downloads last month -
Downloads are not tracked for this model. How to track
Inference Providers NEW
Text-to-Speech
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support
Model tree for owensong/Inflect-Nano-v1
Quantizations
1 model