Serving Local AI on My Jetson Through Durable Streams

Serving Local AI on my Jetson through Durable Streams · S2 On this page

With local AI feeling more and more practical, I wanted to self-host my own models and run my workloads independently without any third-party provider in the mix, and also look into serving my local model to some users reliably. The Jetson series by NVIDIA is a great starting point, and I went with the Jetson Orin Nano Super kit, aka “The most affordable generative AI supercomputer”! It has 1024 CUDA cores and 32 tensor cores and is rated at 67 TOPS (trillion operations per second), which should be good enough for my little experiment which is a small text-to-speech app powered by Kokoro-82M, a neural text-to-speech model.

It is mostly inspired out of need that I don't want to always read a lot of text, but would rather hear it. So I want something where I select some text, pick a voice, and get a link which I can come back to later or share with people. For now that means pasting text into a page, but I'd want something even more lazy-proof eventually which would be a nicer frontend on top of the same core app. Beyond the app itself, I want to land on a small reference architecture for local inference: a self-contained serving layer that exposes a clean API, so the same setup can back a web app, a CLI, or another service without rework.

Try it out at streamtts.dev (It is self-hosted on my Jetson! 😉):

Not a normal Request/Response API

The simplest way to architect this would be:

POST /generate wait return audio.mp3 this.classList.remove('rehype-pretty-copied'), 3000);"> Inference is slower than a normal web request. Kokoro on this Jetson can produce speech faster than realtime, but it is still a GPU job. A minute of audio can take many seconds of compute. A cold first sentence can be slower while the model stack warms up. If multiple users submit at once, a blocking request turns into a line of sockets waiting on the GPU.

The output is also naturally incremental. TTS does not need to finish the entire paragraph before the listener hears anything. The model can generate one sentence, encode that sentence to MP3, append it somewhere, and move on. If I force the whole thing into a single response body, I throw away the best property of the workload.

And I want the result to be shareable. The user should be directed to a link immediately where they can "await" the model to produce all the bytes. If they open it while the Jetson is still working, they should hear the prefix and then follow the live edge.

If we start with request-response, we end up adding a pile of infrastructure like:

queue

database for job bookkeeping

object storage for the finished file

retry logic

dedupe logic

cleanup process

All of this is reasonable. But together, it is a lot for one basic promise:

accept work now produce output later let readers follow along this.classList.remove('rehype-pretty-copied'), 3000);"> The request feels like the wrong lifetime for this. I want the inference job to work seamlessly across network disruptions. I also do not want a dropped browser tab to kill a running generation. Thus the output should have an identity before it is complete, and readers should be able to start at the beginning, catch up to the tail, or come back later and replay the same bytes!

In summary, I want:

submit work get an output stream immediately worker appends model output client awaits the stream this.classList.remove('rehype-pretty-copied'), 3000);"> All of this can be cleanly abstracted over durable streams. A stream is an ordered sequence of records, where a record is just some bytes (here, a chunk of audio plus a little metadata). Durable means every record is persisted, so nothing is lost and a reader can come back later and replay the exact same bytes. Putting the two together, we get a simple but powerful building block.

Append records to the tail, and readers can start at the head, seek to a known sequence number, or sit at the tail and wait for the next record to arrive. A stream store gives you named timelines:

APPEND record READ from seq_num=N TAIL for live records this.classList.remove('rehype-pretty-copied'), 3000);"> Each record is the unit of progress. A record has a sequence number, timestamp, headers, and a body. StreamTTS does not need much more structure than that. We represent records like so:

headers: e: audio i: 3 d: 4210 t: "sentence text" body:

# e = event type # i = index # d = duration (ms) # t = sentence text # e = event type # i = index # d = duration (ms) # t = sentence text" class="rehype-pretty-copy" onclick="navigator.clipboard.writeText(this.attributes.data.value);this.classList.add('rehype-pretty-copied');window.setTimeout(() => this.classList.remove('rehype-pretty-copied'), 3000);"> And the output will be shaped like:

pub/casts/4LwnHZDl_vFC seq 0 meta seq 1 start seq 2 audio sentence 0 seq 3 audio sentence 1 seq 4 audio sentence 2 seq 5 eos # end of stream this.classList.remove('rehype-pretty-copied'),...

Serving Local AI on My Jetson Through Durable Streams

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level