Twenty Thousand Hours Without a Robot – Perception and Training – Robotics

jpatel31 pts0 comments

Twenty Thousand Hours Without a Robot - by Jaimin

Atoms to Algorithms

SubscribeSign in

Twenty Thousand Hours Without a Robot<br>Tuesday, June 2, 2026 · Learning

Jaimin<br>Jun 02, 2026

Share

In February, NVIDIA’s robotics group published an equation almost no one expected to see this decade. The relationship between how much human video you train a robot policy on, and how well that policy performs, fits a near-perfect straight line on a log scale. R-squared of 0.9983. The number that matters is not the line. It is the input: 20,854 hours of human-shot first-person video, collected with zero robots in the loop. I mentioned in yesterday’s post about the startup called Shift, which is planning to feed on.

Bercan@bercankilic

Today, we are launching shift. Starting in NYC, we are bridging the economy of today into the AI economy where all services, goods, and leisure will be affordable, and humanity will progress towards abundance. Please enjoy your free home cleaning and join shift for a lot more!

4:59 PM · May 28, 2026 · 308K Views

143 Replies · 73 Reposts · 506 Likes

That is the answer to a question yesterday’s issue raised about whether neural policies, the side of the architectural fight that pays its bill in training data rather than runtime compute, can actually scale. The bottleneck through 2025 was that real robot demonstrations come out at roughly three robot-hours per day per machine, and a foundation model wants orders of magnitude more. Wearable data collection rigs, hand-held grippers, exoskeletons, and head-mounted cameras are how the field is trying to break that ceiling.

How it actually works

Start with the parallel-jaw gripper, the simplest end effector on a production robot. Two fingers, one degree of freedom. A team at Stanford and Columbia, led by Cheng Chi in Shuran Song ’s lab, took that gripper, mounted a GoPro at the wrist, and gave it to a human (this is the Universal Manipulation Interface , or UMI ). The human walks through a task. The GoPro records first-person video. The gripper geometry is mechanically identical to the robot’s. When the data comes out, the camera angle, the visual scene, and the gripper state look exactly like what the robot would see at deployment. The robot was never in the room. The hands were a person’s.

The wrinkle is that a parallel-jaw gripper is the easy case. A five-finger dexterous hand has twenty-something articulated joints and no clean mapping from a human wearing it. The same lab’s answer is DexUMI, an exoskeleton you wear like an oversized glove with joint encoders at every articulated joint. The encoders read the operator’s joint angles directly, without trying to guess them from a camera. A software step paints the robot hand into the recorded video so the policy sees what the robot would see during real deployment. The reported success rate is 86 percent across two different robot hand platforms after this transfer.

Skill Capture Glove<br>What changes is the price. A teleoperation station with leader and follower arms, the standard since Aloha, runs forty to sixty thousand dollars per workstation. An exoskeleton-based capture rig like AirExo-2 prices at about six hundred dollars in parts. That is a hundred-to-one ratio. The bottleneck stops being how many stations you can afford, and starts being how many environments you can get into.<br>Which is where NVIDIA’s EgoScale paper lands. The recipe has three stages. Pretrain a vision-language-action policy on 20,854 hours of human video, mapping wrist motion and retargeted hand joint angles into a common 22-DOF space. Mid-train on a small slice of carefully aligned paired human-and-robot data so the model learns robot-specific quirks. Post-train on a handful of demonstrations per task. The result is the log-linear scaling law plus 54 percent better task success than no pretraining on a 22-joint dexterous hand, with the learned motor prior transferring cleanly to lower-DOF hands too. Reading this information makes me think that post 2022, kind of everything is getting at realm of training and learning.<br>Figure, the humanoid startup, made the same bet at a different layer. Their Project Go-Big partnership with Brookfield gives Figure access to over 100,000 residential units, half a billion square feet of office, and 160 million square feet of logistics, all environments where humans can wear capture gear and just move through their day. Figure trained Helix entirely on this human video data and showed in September 2025 that the resulting robot responds to commands like “go to the fridge” without any robot demonstrations in the training set. One network now outputs both upper-body manipulation and base navigation, end to end, from camera pixels and spoken language.<br>The production-ready version of this is hybrid. Toyota Research’s Large Behavior Models, published in Science Robotics earlier this year, train on a mixture that explicitly includes 32 hours of UMI data alongside 468 hours of in-house...

robot hours human video data hand

Related Articles