InSight: Self-Guided Skill Acquisition via Steerable VLAs | Stanford Multi-Robot Systems Lab
InSight:<br>Self-Guided Skill Acquisition<br>via Steerable VLAs
Maggie Wang1,<br>Lars Osterberg1,<br>Stephen Tian1,<br>Ola Shorinwa2,<br>Jiajun Wu1,<br>Mac Schwager1
1 Stanford University<br>2 Princeton University
Paper
Code
arXiv
InSight makes a VLA steerable at the primitive-action level ,
then uses a VLM to identify and acquire the primitives a new task requires,
with no human demonstrations of the target skill.
Bottle pouring : 96% vs 16% (CaP-X)<br>Twist-then-pour (14 primitives) : 80% vs 4% (CaP-X)<br>Base skills retained : 100%
Abstract
Vision-language-action (VLA) models can learn manipulation skills from demonstrations,<br>but their capabilities are bounded by the skills in the training data. We present<br>InSight , a framework that unlocks autonomous skill acquisition by rendering VLAs<br>steerable at the primitive-action level (e.g., “move gripper to the bowl”,<br>“lift upward”, “pour the bottle”). InSight consists of two primary<br>stages: (1) an automated segmentation pipeline that partitions demonstrations into<br>labeled primitives via VLM plan decomposition and end-effector poses to enable VLA<br>primitive steerability, and (2) a VLM-guided data flywheel that identifies missing<br>primitives required to accomplish a novel task, autonomously attempts demonstrations of<br>the missing primitives with VLM-proposed low-level control, and automatically labels, stores,<br>and integrates successful demonstrations into the VLA training set. We evaluate InSight<br>across simulation and real-world manipulation tasks, including block flipping, drawer<br>closing, sweeping, twisting, and pouring, without any human demonstrations of these<br>target skills. Once learned, these primitives can be composed to execute novel,<br>long-horizon tasks without additional human demonstrations. Our findings demonstrate that<br>primitive steerability provides a practical foundation for continual skill acquisition in<br>VLA policies.
Turn sound on
Motivation
Consider a robot on Mars, trained only to scoop rocks. When a<br>dust storm<br>coats its solar panels, it must sweep them clean, a behavior it was never shown. A VLA can only perform the skills in<br>its demonstrations, and acquiring a new one, through more data or reinforcement learning, is costly<br>to repeat for every task.
Yet new skills are rarely fully novel: they recombine primitives the policy already knows. Sweeping<br>and scooping share approach and lowering, differing only in a lateral push; flipping a block reuses<br>pick-and-place's grasp-and-lift and adds a rotation. A standard VLA already encodes these<br>primitives, but entangles them in a single task instruction, so they cannot be steered individually.
InSight makes the primitives steerable and uses a VLM as an active agent , not just a<br>test-time planner over a fixed skill set, but one that flags the primitive a task is missing, drives<br>the robot to acquire it, and retrains it back into the policy. Acquired skills then persist and<br>recombine for future tasks, enabling continual learning.
How InSight Works
Stage 1<br>Primitive steerability
Human demonstrations are automatically segmented into primitive-labeled trajectories by aligning<br>a VLM-generated plan with gripper transitions and end-effector motion. Fine-tuning on these labels<br>produces a VLA that can be steered one primitive at a time.
Stage 2<br>Skill acquisition
For a novel task, the VLM flags any primitive gap, drives a low-level controller to attempt it,<br>and verifies success with a VLM oracle. Successful rollouts are labeled, stored, and used to<br>retrain the VLA, forming a data flywheel that grows the skill set.
Stage 1. A demonstration is split into labeled primitives using gripper-state and dominant-motion cues.
Stage 2. The VLM identifies and parameterizes a missing primitive, the robot executes it, and a VLM oracle verifies success.
Implementation. InSight fine-tunes a π0.5 VLA with LoRA, and uses Gemini 3 Flash as the<br>VLM across four roles: demonstration segmentation, task planning, primitive-gap proposal, and<br>image-based success checking. The framework is agnostic to the underlying VLA.
compose (climax)<br>============================================================ -->
Acquiring New Skills
Each skill below is acquired with no human demonstrations of that skill . Starting from<br>demonstrations of a different task, InSight identifies the missing primitives, practices them,<br>and folds them back into the policy. On the left is what the robot was trained on; on the right is the<br>new skill it acquires.
Block flipping from pick-and-place demos only · 8× speed
Trained onpick-and-place
→
Acquired (no human demos)block flipping (rotate-block primitive)
With no human demonstrations of the flip, InSight practices the missing rotate-block primitive<br>and climbs to 75% block-flip success over 479 rollouts. An RL baseline (SAC) given the same<br>budget never completes a flip (0%...