Slow Token, Fast Action – Learning in Robotics

Slow Token, Fast Action - by Jaimin - Atoms to Algorithms

Atoms to Algorithms

SubscribeSign in

Slow Token, Fast Action Thursday, June 4, 2026 · Learning

Jaimin Jun 04, 2026

Figure’s Helix 02 ran an 8-hour autonomous shift at a Brookfield residential site in May with a stack that, written down, looks almost insulting in its simplicity. A 7-billion-parameter vision-language model in the head asks itself ten times a second what the scene contains and what the human is asking for, and emits a single dense intent vector. An 80-million-parameter network reads that vector and the current camera frames and decides on the next joint targets twenty times faster than the first network can think. A third network, ten million parameters at a thousand hertz, is the only thing keeping the robot upright. Three networks at three speeds is the dominant humanoid control architecture of 2026. This week walked the path from research policy to deployed robot. Monday separated the inner-loop fight between classical model-predictive control and end-to-end neural policies. Tuesday gave the neural side a data-scaling argument. Wednesday explained the sim-to-real machinery that puts those policies on hardware. Today is one floor up: the policy itself is no longer a single network. It is a hierarchy. And the shape of that hierarchy is suddenly the hottest argument in robotics research. How it actually works

The split has a clean justification once you see it. The slow tier handles language and long-horizon context: “clear the table, the cup is on the left, the bowl has milk in it, do not knock it over.” That work needs a pretrained vision-language model, a billion-plus-parameter transformer with tens-to-hundreds of milliseconds of latency. The fast tier handles the next half-second of motion, every five to ten milliseconds, faster than a humanoid arm can drift. The big model cannot meet that latency. A small, fast network can. Figure names this split System 1 and System 2, borrowing Kahneman’s language for human cognition. System 2 reasons; System 1 acts. The Helix 02 release in January 2026 added System 0 below them, a tiny network at a thousand hertz to handle balance and posture. That third layer replaced 109,504 lines of hand-tuned C++ in one go. Three networks at three timescales fit inside an inference budget the new generation of robot computers (NVIDIA’s Jetson Thor, mostly) is built to handle.

NVIDIA calls almost the same idea an Action Cascade. The GR00T N1.7 release in April 2026 ships a 3-billion-parameter model in two halves: a vision-language reasoner that emits short, abstract action tokens, and a diffusion transformer that takes those tokens plus live joint states and produces motor commands. The two halves are trained together, in one pass, with the gradient flowing through both. That detail matters. The reasoner is not handed an instruction manual; it learns what abstract tokens the motor network can carry out, and the motor network learns to denoise the tokens the reasoner tends to emit. They learn each other’s language.

Physical Intelligence, the startup behind π0 and π0.5, takes a softer version of the same idea. There is no separate fast network. The same model runs twice per step: once to predict a semantic subtask label (”pick up the cutting board”), then again to predict the actual motor commands conditioned on that label. The hierarchy lives in the decoding schedule, not the architecture. The same lab also shipped Hi Robot, which goes the other way: a separate planner VLM that decomposes a complex prompt into English commands, fed into the original π0 as the executor. Two policies, not co-trained, talking in English. Reported win: 40 percent better instruction accuracy than asking GPT-4o to do the whole task. (now they have even better option with other models above GTP-4o)

So there are three camps. End-to-end co-training, where the boundary is an implementation detail of one model. Soft hierarchy, where the boundary lives inside one network’s decoding pass. Hard modular hierarchy, where two networks talk in plain English. A new April 2026 paper, Libra-VLA, reframes the argument as a tuning problem: performance follows an inverted-U on action-decomposition granularity, peaking at a specific middle point. Too much abstraction at the top and the bottom cannot ground the intent. Too little and the top has no work to do. And the counter-argument worth naming: Toyota Research Institute’s Large Behavior Models are monolithic. One network, one timescale, one action chunk every 1.6 seconds. TRI’s bet is that the diffusion process itself absorbs the multi-timescale structure (early denoising steps look like planning, late steps look like refining) and that forcing the split into the architecture is premature. The headline result is 80 percent less data to learn new tasks. That is the evidence the dual-system camp has to answer. New this week

NVIDIA released GR00T N1.7 in April as an open,...

Slow Token, Fast Action – Learning in Robotics

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy