The Token at the Seam – Learning in Robotics

jpatel31 pts0 comments

The Token at the Seam - by Jaimin - Atoms to Algorithms

Atoms to Algorithms

SubscribeSign in

The Token at the Seam<br>Friday, June 5, 2026 · Learning

Jaimin<br>Jun 05, 2026

Share

Yesterday ended at the seam between a robot’s slow, thinking network and its fast, acting one: a short stream of numbers that one emits and the other reads. Here is what that stream costs. A 7-joint robot arm controlled fifty times a second produces 350 little numbers every second. Write each one down as its own word, the way the first generation of robot brains did, and one second of motion takes more words than this paragraph. Compress the whole second the way a JPEG compresses a photo, and it takes a handful. Or skip words entirely and draw the motion as a smooth curve. Three ways to spell a movement. Today’s robot foundation models split exactly along that choice, and it is the cleanest way to tell the families apart.<br>This closes our Learning week. Monday asked whether a robot should solve math at runtime or memorize the answers in its weights. Tuesday followed the training data to camera-wearing humans. Wednesday explained how policies trained in simulation survive contact with reality. Thursday split the robot’s brain into a slow reasoner and a fast actor. Today is the language they speak across that split.<br>How it actually works

The big models that power chatbots do one thing: predict the next word from a fixed vocabulary. The models steering robots are built from the same parts, which creates an awkward problem. A robot action is not a word. It is a list of numbers, one per joint, refreshed dozens of times per second. Every robot foundation model needs a rule for turning continuous motion into something word-like, and that rule, the action tokenizer, has quietly become one of the most consequential design choices in robotics.

The first answer was simple: chop each number’s range into 256 levels and give each level a name. Google DeepMind’s RT-2 did this in 2023, literally borrowing 256 of the rarest words in its language model’s vocabulary and reassigning them to motor positions. OpenVLA, the open-source landmark of 2024, kept the same scheme and showed a 7-billion-parameter open model could beat a 55-billion-parameter closed one. But the scheme has a flaw that shows up exactly when robots get good: at high speeds, each action is nearly identical to the last, so the model can look excellent in training by lazily repeating itself, while learning almost nothing about motion.

The second generation split into two camps. Physical Intelligence’s FAST tokenizer borrows a sixty-year-old idea from image compression: transform the motion into frequency space, where smooth movement collapses into a few meaningful coefficients, and spell those instead. Same robot, same data, up to five times faster training. The other camp, led by the same lab’s π0 model, refuses to spell at all: a separate “action expert” head starts from noise and bends it into a smooth motion curve, no vocabulary involved. Drawing beats spelling for producing fluid 50-times-a-second motion, but it teaches the language part of the brain more slowly.

The 2026 consensus is sneakier than either camp: do both. Train the big network on compressed tokens, because tokens are how transformers learn fastest, while a separate drawing head learns continuous motion behind a one-way gate that keeps its lessons from leaking back and damaging the language brain. Physical Intelligence calls this knowledge insulation. NVIDIA’s GR00T models land in nearly the same place from the other direction. Across competing labs the answer is converging: tokens are for learning, curves are for acting, and the seam between yesterday’s two systems turns out to be a training trick rather than a product anyone can own.<br>New this week

NVIDIA used its COMPUTEX keynote week to ship the numbers behind its open-robot-model strategy: GR00T models have been downloaded 274,000 times, the companion simulation dataset has passed 10 million downloads, and the newest GR00T 1.7 model, pretrained on 20,000 hours of human point-of-view video, is now commercially licensed. The action vocabulary is being given away; the computers that run it are not. (NVIDIA blog)<br>Two papers posted to arXiv this week push opposite ends of the spelling question. BlockVLA speeds up token-by-token robot models 3.3x by denoising whole blocks of tokens in parallel (arXiv 2605.13382). RotVLA proposes that the vocabulary should not describe any particular robot’s joints at all, but live in an abstract geometric space learned partly from human videos, so one vocabulary can drive many different bodies (arXiv 2605.13403).<br>What to notice

The visualization draws one second of robot arm motion spelled three ways: a long row of per-instant tokens, a short row of compressed frequency tokens, and an unbroken curve. The thing to notice is that the difference is entirely about what the learning system is asked to predict. The robot’s motors...

robot motion second from model learning

Related Articles