Macrodata Refiner – infrastructure for the robotics data loop

Tmpod1 pts0 comments

Macrodata Labs — every strong model starts with great data<br>Every strong model starts with great data<br>Macrodata Labs helps robotics teams turn raw physical-world data into better training datasets. Refiner, our open-source data processing framework, lets you build pipelines locally in Python, then scale the same pipeline on managed cloud compute.<br>Get startedView docs<br>$pip install macrodata-refinercopycopied

read132/sannotate8/sfilter12.7%write3.4MB/ss3hfgcs0 shards

hdf514.2 GBzarr6.8 GBlerobot24.0 GBread132/sannotate8/sfilter12.7%write3.4MB/ss3hfgcs0 shards

// 01<br>Introducing Refiner: focus on data, not infrastructure

COMPOSABLE BY DESIGNDefine pipelines from simple primitives. Refiner handles scale, orchestration, and everything in between.

NATIVELY MULTIMODAL Process robot episodes with trajectories, camera streams, audio, and language in one pipeline. Refiner handles streaming IO, sharding, and native data formats.

BUILT FOR MODELS Deploy open-source models or bring your own API. Async execution, smart batching, parallelism, and retries are handled either way, locally or at cloud scale.

Read the docs<br>Read any formatAnnotate with VLMsModel workflowsLarge-scale dedup

10<br>11

import refiner as mdr

mdr.read_hdf5(<br>"hf://datasets/nvidia/ALOHA-Cosmos-Policy/**/*.hdf5",<br>groups="/",<br>datasets={"action": "action", "observation.state": "observations/qpos"},<br>.to_robot_rows(fps=25, robot_type="aloha")<br>.write_lerobot("s3://robots/aloha-lerobot")

// 02<br>Get more from robotics data

Annotating tasks with gemini-3.5-flash<br>Batch 1 of 6<br>IDXStatesActionsVideoTask001284[280, 14] ▁▁▁▁▁▁[280, 7] ▁▁▁▁▁▁<br>001285[140, 14] ▁▁▁▁▁▁[140, 7] ▁▁▁▁▁▁<br>001286[320, 14] ▁▁▁▁▁▁[320, 7] ▁▁▁▁▁▁<br>001287[210, 14] ▁▁▁▁▁▁[210, 7] ▁▁▁▁▁▁<br>001288[260, 14] ▁▁▁▁▁▁[260, 7] ▁▁▁▁▁▁

INGEST ANY FORMAT Read and convert Parquet, HDF5, MCAP, Zarr, RLDS, and LeRobot without custom scripts or slow local downloads.

SUBTASK ANNOTATIONS & HAND-TRACKING Use optimized pipelines for timestamped subtask annotation and ego-vision hand tracking across robot episodes.

REWARD MODELSEstimate task-completion progress with reward models such as Robometer, then use those scores to weight the frames that matter most.

See examples

// 03<br>Scale instantly with launch_cloud()

ONE LINE TO CLOUD.launch_local() becomes .launch_cloud(). Scale the same pipeline without rewriting code, changing data formats, or rebuilding your local workflow.

INSTANT CPU & GPU ACCESSRun many shards across managed CPU and GPU workers without reservations or machine provisioning. Macrodata Labs handles orchestration, scheduling, and worker lifecycle.

PAY FOR WHAT YOU USE Resources attach when work starts and release when it finishes. You pay for the compute your jobs actually consume, without idle cluster overhead.

See pricing<br>Switching pipeline.launch_local() to pipeline.launch_cloud() runs the same pipeline on 5 × H100 GPUs — 8m 00s locally becomes 48s, a 900% performance increase, lifting throughput from 5 MB/s to 100 GB/s for about $0.27 per run, billed per second.Complete<br>pipeline.launch_cloud()<br>throughput100GB/s<br>GPUs5× H100<br>run time48s<br>local

cloud

10× faster

$0.27 / run · pay only for what you use

// 04<br>Supervise in real time

1089frames/s<br>153mb/s<br>15tasks queued

logslive<br>Job: robotics-demo ID: 019e68d3...eda75<br>Status: pending Kind: cloud<br>Stage 0 robotics_transform 0/64 shards · workers run=0 done=0 tot=8<br>─ logs ─────────────────────────────────────────────

TRACE EVERY DATASETInspect the DAG, transforms, launch settings, dependencies, and captured code behind each dataset build, so every output is traceable back to the run that produced it.

PINPOINT FAILURESSee the stage, shard, worker, traceback, logs, and retry state for failures instead of piecing together what happened after the fact. Inspect it on the web platform or let your agent pull it through the CLI.

SURFACE BOTTLENECKSSee whether a run is limited by decoding, model calls, writing, CPU, memory, network, or GPU before scaling it further.

// GET STARTED<br>Get more from your robotics data.<br>Use Refiner to start extracting more signal from your data today, or reach out to discuss your robotics data challenges directly.<br>$pip install macrodata-refinercopycopiedView docsTalk to us

data refiner pipeline macrodata robotics scale

Related Articles