Enpire: Agentic Robot Policy Self-Improvement in the Real World

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World ENPIRE Scroll to explore

Contents

Abstract Achieving dexterous robotic manipulation in the real world relies heavily on human supervision and algorithmic engineering, which is a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined to digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with single or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world robot learning into a controllable optimization procedure that agents can manage, thus minimizing human effort while allowing fair ablations across training recipes and agent variants. Powered by ENPIRE, frontier coding agents can autonomously develop a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks in the real world, such as PushT, organizing pins into a pin box, and using a cutter to cut a zip tie. Coding agents can improve policies with various PI regimes, such as heuristic learning, tool calling, behavior cloning, offline or online RL. Moreover, ENPIRE can be significantly accelerated on a robot fleet, and we propose two metrics, namely, Mean Robot Utilization (MRU) and Mean Token Utilization (MTU) to measure the efficiency of multiagent physical autoresearch. We also include simulation results in RoboCasa. Our findings suggest a practical and scalable path toward autonomously advancing robotics in the real world. Learned Manipulation Policy Policies trained with ENPIRE reach a 99% pass@8 success rate across the showcased manipulation tasks.

Push T

Pin Insertion

GPU Insertion

Tie Ziptie

Cut Ziptie

ENPIRE runs fully autonomously on real robots. Working only through the automated reset and verification interface, a team of coding agents proposes algorithmic hypotheses (heuristic learning, behavior cloning, offline and online RL), tests them against the real-world success rate, and keeps the changes that move it. The idea tree below traces that search as a hypothesis git-tree — one branch per agent, one node per idea tried — plotted on the same wall-clock-time axis as the success-rate curve, so you can see the ideas that moved the curve upward. I1I2I3I6I7I13I22I24I34I8I36I62I69I70I72I74I75I77I78I79I80I81I82I83I84I85I86I4I9I12I26I37I16I41I44I48I52I55I57I59I61I64I65I67I68I71I73I76I18I29I35I42I45I47I49I50I53I54I56I58I60I63I66I10I23I27I5I11I15I17I19I21I28I33I38I39I40I46I51I14I20I25I30I31I32I43050%100%team-avg best success rateI16 Online RL mix Demo+3.8 ppI37 BC regularization+10.8 ppI56 Tweak BC term weight+0.4 ppI66 Tune batch size 1024→512+0.9 ppI76 Compensate controller+1.3 pp01 h2 h3 hresearch wall-clock time →each dot = an ideaclick any dot to read the ideagreen ring = idea that raised the team-avg scoregreen curve = cross-agent inspiration

Figure 1: Each coding agent explores its own branch of ideas, one lane per branch. Every dot is an idea it tried; a green ring marks an idea that raised the team’s average success rate, and green curves trace cross-agent inspiration. The lower panel tracks the team’s average success rate climbing over research wall-clock time.

ENPIRE System ENEnvironment Construct reset, safety, verification, and logging interfaces the agent can call.

PIPolicy Improvement Generate and revise policy code from rewards, videos, traces, and failure cases.

RRollout Run budgeted robot trials and preserve the state, action, video, and result for audit.

EEvolution Compare branches, reuse successful recipes, and prune hypotheses that fail on hardware.

Construct Environment Policy Improvement Action Obs Reward env.py

class InsertionEnv:

def reset(self):

# TODO: auto task reset

pick_and_place(obj, target)

go_home()

...

def get_reward(self, obs, act):

# TODO: scalar reward

10 mask = sam3(obs['left'])

11 pos = boundlsdf(obs, mask)

12 ...

14 def get_observation(self):

15 ...

17 def step(self, act):

18 ...

Human User

Coding Agent

Tool APIs

Perception Planning Control

ENPIREEnvironment 01Literature review PLDRL-TokenCaP-X

02Propose algorithm variant HeuristicsOff2On RLCode-as-policyBC

03Optimize Infra Data SamplerParam Sweep

04Summarize experiment...

Enpire: Agentic Robot Policy Self-Improvement in the Real World

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews