How to Build an ML Framework in Rust, from Scratch, in a Weekend

Home

Mar 6, 2026

ZML is an ML framework written in Zig, and it’s the only ML framework that fully focuses on inference. But how does it work, and what does it mean to focus on inference rather than training?

I wanted to understand the stack properly, so I built a minimal version of ZML in Rust. This post is a guided tour of what sits underneath, and how you can build your own ML framework to run a real model.

I’ll start this blog by going through how ZML works under the hood and explaining the end-to-end pipeline from model definition to StableHLO to PJRT. Once we have a solid understanding of what it should look like, we’ll build a toy/educational ML framework. The first version will be ugly, and the last part of this blog is devoted to cleaning it up.

What we’ll cover:

Part 1: Reading — how ZML works under the hood

Part 2: Building — building a toy framework that emits StableHLO

Part 3: Running — compiling and executing via PJRT

Part 4: Cleaning — making the API feel like a real framework

Demo — running SmolLM2-135M end to end

The outcome of this blog is that we’ll build a toy/educational ML framework that will run the SmolLM2-135M model, something like this:

❯ ./target/release/examples/smollm2 chat No compiled artifact given, compiling from scratch (seq=256)... Compiled in 728.99ms Loaded weights in 91.99ms

SmolLM2-135M-Instruct ready. Type a message and press Enter. Type "exit" to quit.

You> Hi 👋 Assistant> Hello! I'm here to help with any creative writing needs. What's on your mind? [24 prompt tokens, 19 generated | TTFT 306ms | 4.5 tok/s] I assume you’re comfortable with programming and have a basic understanding of neural networks: you know what a forward pass, a matmul, and a weight tensor are. You don’t need to be an expert, but I’ll cut to the chase pretty quickly.

Short intro to ZML

ZML is an ML framework for inference, written in Zig. It sits at roughly the same level as PyTorch, but it is optimized for inference only. That constraint shapes many of its design decisions and makes it feel quite different from PyTorch.

To understand what ZML is trying to solve, it helps to look at where PyTorch is less natural as an inference runtime. PyTorch defaults to eager execution: ops run immediately as your Python code executes them. That is excellent for model development and experimentation, where flexibility matters most. But for production inference, you often end up introducing a second stage such as export, tracing, or compilation, and that handoff is where edge cases and operational complexity tend to appear.

Inference has a different nature: the goal is not to support an interactive research workflow, but to produce a stable, reusable artifact that runs predictably in production. This is why ZML is built around a graph-compile-run pipeline: you first stage the computation as a graph, then compile that graph into an accelerator-specific executable, and finally run that executable with real inputs. In other words, the model definition is treated less like code to execute line by line, and more like a specification that can be analyzed, optimized, and compiled ahead of time.

That model fits inference well, where we usually care more about predictability, startup behavior, hardware targeting, and repeatable performance than maximum flexibility at runtime.

If you’re coming from a PyTorch world, three things stand out in ZML:

Graph-first : the model is staged as a computation graph up front, rather than executed eagerly op by op.

Model and weights are separate : you can validate and compile the model structure without loading the actual weights. When weights are tens of gigabytes, that makes iteration much faster.

Compiled for the target : ZML is designed to compile the model into an executable for a specific backend and deployment target, which makes it feel much closer to a serving stack than a research environment.

As someone who has worked on inference for several years, this resonated with me immediately. If you’re curious how a stack like this can even be built, follow along: we’ll start with a high-level overview of ZML, then build out the core pieces ourselves.

Part 1: Reading

The stack

There’s a big stack involved here, and it’s not immediately obvious what does what. But nothing here is difficult per se; there’s just a lot of it.

The short version is:

Write the model in Zig, lower it to StableHLO (an MLIR dialect), then let OpenXLA compile and run it through PJRT.

Before we unpack each layer, here’s the short explanation of each piece:

StableHLO is an IR (intermediate representation) that describes ML computations. It’s a dialect of MLIR , so in many places, including this blog, they’re used interchangeably.

PJRT is the runtime API that compiles and executes those computations on actual hardware.

OpenXLA is the umbrella project...

How to Build an ML Framework in Rust, from Scratch, in a Weekend

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy