Inflect-Nano, a 4.63M-parameter local TTS model with its own vocoder

Lihh271 pts0 comments

owensong/Inflect-Nano-v1 路 Hugging Face

Log In<br>Sign Up

Inflect-Nano-v1

Edit 06/17/2026 -- I'm really happy to see that this model is doing decently! If more people find it useful, I might consider training a v2 with a much larger budget.<br>Inflect-Nano is #3 trending on Hugging Face's TTS leaderboard! Can it get any higher? If you would like to see a v2, just like/favourite this model to get more people see it. Thank you for everyone for checking out this model!

Inflect-Nano-v1 is a tiny English text-to-speech model with 4.63M total inference parameters, including its vocoder.

It is not trying to beat large TTS models. It is a small, local, complete text-to-waveform stack built to test how far ultra-lightweight speech synthesis can go.

Highlights

4.63M parameters total

Includes the vocoder

24 kHz audio

Single English male voice

Runs locally with PyTorch

Built for tiny-model experiments, local assistants, embedded demos, and efficient inference research

Listen

Text<br>Audio

"Did the timing change?" she answered. "Then why did Logan leave?"

Who puts a parking meter next to an ER label?

Please say neighborhood, statistics, and anesthesiologist clearly, without rushing through the middle syllables.

I said 91, not 306, which is a very different number.

The inference path looked natural, but the decoder still needed a smoother transition before Marcus approved the final test.

The appointment moved to 1:25, the invoice was $674.96, and the archive was labeled 1998.

If Logan sounded uneasy, then it happened near Long Beach, and the pause has to carry that.

The word aluminum should not steal attention from the softer ending after entrepreneur.

Install

git clone https://huggingface.co/owensong/Inflect-Nano-v1<br>cd Inflect-Nano-v1<br>pip install -r requirements.txt

Generate Speech

python inference.py --text "Wait, are you actually being for real now?" --out sample.wav

CPU:

python inference.py --device cpu --text "Please say neighborhood clearly." --out sample_cpu.wav

With simple controls:

python inference.py \<br>--text "The appointment moved to 1:25." \<br>--length-scale 1.03 \<br>--pitch-scale 1.00 \<br>--energy-scale 1.00 \<br>--out sample_controlled.wav

Local Gradio demo:

python app.py

Model Size

Part<br>Parameters

Acoustic model<br>3.465M

Vocoder generator<br>1.167M

Total inference stack<br>4.632M

The model files are:

weights/inflect_nano_v1_acoustic.pt<br>weights/inflect_nano_v1_vocoder.pt

Repo Layout

weights/ model weights<br>examples/ audio examples<br>assets/ README banner<br>inflect_nano/ runtime model code<br>third_party/tiny_tts_frontend/ vendored text frontend used for English G2P/token IDs<br>inference.py simple CLI inference<br>app.py local Gradio demo

The model itself is in weights/. The vendored frontend is included only so the released model can reproduce the same text normalization and tokenization path.

What Makes It Different

Many small TTS projects depend on a separate larger vocoder. Inflect-Nano-v1 includes the vocoder in the published inference stack, so the full text-to-waveform path stays under 5M parameters.

Pipeline:

text<br>-> English text frontend<br>-> compact FastSpeech-style acoustic model<br>-> 80-bin mel spectrogram<br>-> small Snake HiFi-GAN-style vocoder<br>-> 24 kHz waveform

Architecture

The acoustic model is a compact non-autoregressive FastSpeech-style network. It predicts duration, pitch, energy, and brightness, then decodes an 80-bin mel spectrogram.

The vocoder is a small Snake-activation HiFi-GAN-style generator trained for 24 kHz waveform reconstruction.

Main settings:

Setting<br>Value

Sample rate<br>24 kHz

Mel bins<br>80

Acoustic hidden size<br>168

Encoder layers

Decoder layers

Vocoder upsample rates<br>8, 8, 2, 2

Good For

Tiny local TTS experiments

Offline assistant prototypes

Efficient inference research

Embedded speech demos

Browser/WASM-style exploration

A baseline for sub-5M TTS work

Not Good For

Production narration

Accessibility-critical output

Voice cloning

Multilingual speech

High-fidelity audiobook generation

Matching large modern TTS systems

Limitations

This is a very small experimental model. It can sound robotic, buzzy, or unstable, especially on difficult unseen text. Long prompts and unusual phrasing are less reliable. The vocoder is also a clear quality bottleneck.

Use it as a tiny-model research/demo release, not as a production TTS engine.

License

Apache-2.0.

This repository includes a small third-party English text frontend for tokenization/G2P compatibility. Its license is included at third_party/tiny_tts_frontend/LICENSE.

Downloads last month -

Downloads are not tracked for this model. How to track

Inference Providers NEW

Text-to-Speech

This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for owensong/Inflect-Nano-v1

Quantizations

1 model

model text inference vocoder inflect nano

Related Articles