Hours of Humanoid Teleop, Recorded in Real Homes

DISCURSIVE2 pts0 comments

500 Hours of Humanoid Teleop, Recorded in Real Homes

SubscribeSign in

500 Hours of Humanoid Teleop, Recorded in Real Homes<br>Weekly Physical AI Roundup<br>Jay Chia<br>Jun 25, 2026

Share

BitRobot, Unitree, and Hugging Face put HIW-500 on the Hub this week: more than 500 hours of Unitree G1 teleoperation recorded in actual homes across Southeast Asia, over 23,000 episodes with subtask and language annotations. Open humanoid data at this scale, collected outside the lab, has been rare, and LeRobot has added G1 support so it lands within reach of anyone with the robot.

A real-home humanoid dataset you can actually pull

HIW-500, from BitRobot Network with Unitree and Hugging Face, is 500+ hours of human teleoperation on a Unitree G1 collected across real homes in Southeast Asia. The card lists more than 23,000 episodes annotated with 161 subtask labels, 148,000+ subtask annotations, and language instructions, which makes it usable for imitation and VLA training and for long-horizon household tasks rather than single skills. What sets it apart is the setting: most open humanoid data is lab-staged or simulated, and real-home teleoperation at this volume is hard to come by. The LeRobot team has re-encoded the set into LeRobot format, cutting it from roughly 10TB to about 2TB for streaming, and added Unitree G1 support, though the card still lists that format as coming soon, so for now the data comes as raw ROS bag / MCAP and the license isn’t stated. It’s pitched as the largest open humanoid teleop dataset to date; even setting that claim aside, it’s one of the more useful artifacts to land this week.

Thanks for reading Topic Queue! Subscribe for free to receive new posts and support my work.

Subscribe

Research

Vesta: one model for localization, navigation, and planning

NVIDIA GEAR’s Vesta folds localization, spatial reasoning, navigation, and long-horizon planning into a single model with a multimodal memory harness, instead of the usual multi-model stack where errors cascade between components. It reports beating individual specialist baselines by more than 20% and improving real-world task success by over 35%, though the numbers are single-setup and there’s no code or weights out yet, so read it for the architectural bet.

HumanScale: human video as a better pretraining source than teleop

HumanScale runs a controlled comparison of egocentric human video against teleoperated robot data as a pretraining source, with post-training and validation held fixed. Given the same amount of pretraining data, the human-video models reach 24% lower validation loss on real-robot action prediction and markedly higher real-task success, which supports pretraining on human video and then adapting with a small amount of labeled robot data. It still uses robot data, just much less of it.<br>CAIP: pulling paired vision and action out of human video

CAIP (a Berkeley and NVIDIA group including Fan, Zhu, Malik, and Darrell) goes after the same data scarcity from a different angle. Control policies need vision paired with actions, which robot trajectories provide but cannot supply at pretraining scale, so CAIP treats 3D hand poses extracted from egocentric video as a stand-in for end-effector actions and learns an action-image representation from 32,000 hours of human video plus only 88 hours of robot data, beating encoders like DINOv2 and R3M. Read next to HumanScale, the two reach the same place by different routes: human video as the scalable substrate, with robot data kept small.<br>Two routes to dexterous hand data without a robot

Two papers attack hand-data collection from opposite directions. A Berkeley group’s Do as I Do reconstructs hand-object interaction from ordinary monocular RGB human video and retargets it onto multi-fingered robot hands. Wh0 instead generates the data, using video world models to synthesize a 50,000-episode egocentric hand dataset, and reports lifting zero-shot success on 18 unseen real tasks from 8.3% to 38.9%.<br>AutoDex: a self-running grasp-data rig

AutoDex is a fully automated setup that generates, physically executes, and labels dexterous grasps with no human in the loop, using a 20-camera rig to handle occlusion. It collected 3,593 trials across two hands and 100 objects at about 4.8x teleoperation throughput, and policies trained on it hit 76% real-world success against 34% for sim-only data, since the rig certifies which grasps actually held.<br>OpenHLM: a recipe for whole-body humanoid loco-manipulation

OpenHLM maps language and pixels directly to all of a humanoid’s degrees of freedom rather than splitting upper and lower body into separate controllers, and the value is in the ablations. It finds joint-based teleoperation beats other interfaces, wheeled-platform pretraining transfers to legged humanoids, and the resulting policy reportedly beats GR00T N1.6 and Ψ₀ with under half the demonstration time.<br>Does fine-tuning a VLM into a VLA make it forget?

This benchmark introduces Act2Answer, which...

data real robot human video humanoid

Related Articles