Robotics Teams Are Rebuilding the Data Stack from Scratch

The data layer tax for robot learning — Rerun

Search documentation / esc Start typing to search the docs

Navigate Open

The data layer tax for robot learning Written by Nikolaus West 1 month ago

Scaling laws are starting to work for robotics, producing capabilities that were unthinkable just a few years ago. End-to-end models predict robot actions directly from sensor inputs, which simplifies the on-robot software but makes everything from data collection to training dramatically harder. LLM teams scaled on mature data infrastructure to improve performance through fast iteration on data. Robotics teams are trying to scale without it.

Most teams build data tooling from scratch because existing infrastructure wasn't designed for the multi-rate and multimodal data that powers robotics learning. Across the data journey from collection to training, common operations are harder and slower than they should be. The cumulative cost, in iteration speed, engineering focus, and GPU utilization, is what we'll call the data layer tax. Reducing this tax is a major lever to move faster and scale in the race towards what looks like the biggest market the world has ever seen. Architecturally, the data layer owns storing, modeling, and accessing data. The data layer for Physical AI is still immature, and the cost is visible at every stage of the pipeline. If you're building or investing in robot learning, this post is a map of where that tax comes from. We'll walk backwards from evaluation to collection, showing how requirements cascade upstream and why the tax compounds with data scale, source variety, and curation sophistication. Policy evaluation An extensive set of “evals” is core to any LLM team’s ability to make rapid progress. Evaluation of robot behavior is much more difficult, which has a cascading effect on the entire pipeline. For robotics teams, even a small real-world evaluation on a trained policy takes hours or days of robot trials and careful design and operations. This means making rapid progress against extensive, repeatable, and fast evals aren’t feasible in robotics. Teams instead rely on proxy metrics that score data quality directly, things like reward models that assess task progress, 3D reconstruction quality as a signal for calibration correctness, or just estimating the jerkiness of trajectories. These tell you whether individual episodes or samples look good or bad, not whether they produce a better policy. Since real evaluation runs are harder to do, it’s important to study each one deeply. Many important decisions come from researchers who are deeply steeped in the data, watching eval rollouts and using their intuition about the full system to decide how to proceed. From a data infrastructure perspective, evaluation looks a lot like collection. You record model inputs, outputs, and targets along with metadata like model version, subtask, and environment configuration. Researchers then review large numbers of rollouts, aggregate them by metrics, and drill into specific recordings. Tracing a rollout back to the training data that caused it often requires manual detective work across disconnected tools and formats. Every point of friction adds up to slower iteration times and insights that don’t feed back to training better policies. Model training Robot behavior learning shares many foundations with other machine learning tasks. What changes is that these models output actions over time. This added time dimension drastically increases the complexity of the data layer that supports training for two key reasons: sample construction and video compression. Sample construction during training When training large models, we must feed the expensive GPUs data fast enough to maximize utilization. Researchers steer the behavior of the model by selecting what data to include and how to sample it. Researchers often want to compose multiple datasets together when training models. It’s common to use a weighted combination of multiple datasets and sometimes even per-time step weights for sampling probability or loss weighting.Consider training a vision-language-action model (VLA) with action chunking like ACT or pi0.5. A humanoid robot model could consume three video streams from head and wrist cameras, positions and velocities from 30+ joints, gripper states, and a language instruction. Architecture overview of the GROOT N1 model by Nvidia.Each training sample in a batch starts from a single time step from one of the episodes in the dataset. For a basic VLA, the sample itself consists of camera frames from each view, the robot's current state, and a chunk of future actions, often the next 50-100 time steps. Somewhere between recording on robot and training these inputs all need to be time-aligned, which is a common source of subtle bugs. Basic sample construction for VLA model training. Note that target actions are taken by looking forward in time from the sample starting point.In this case, a naive...

Robotics Teams Are Rebuilding the Data Stack from Scratch

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews