InSight: Self-Guided Skill Acquisition via Steerable VLAs

InSight: Self-Guided Skill Acquisition via Steerable VLAs | Stanford Multi-Robot Systems Lab

Maggie Wang1, Lars Osterberg1, Stephen Tian1, Ola Shorinwa2, Jiajun Wu1, Mac Schwager1

1 Stanford University 2 Princeton University

Paper

Code

arXiv

InSight makes a VLA steerable at the primitive-action level ,

then uses a VLM to identify and acquire the primitives a new task requires,

with no human demonstrations of the target skill.

Bottle pouring : 96% vs 16% (CaP-X) Twist-then-pour (14 primitives) : 80% vs 4% (CaP-X) Base skills retained : 100%

Abstract

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight , a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., “move gripper to the bowl”, “lift upward”, “pour the bottle”). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies.

Turn sound on

Motivation

Consider a robot on Mars, trained only to scoop rocks. When a dust storm coats its solar panels, it must sweep them clean, a behavior it was never shown. A VLA can only perform the skills in its demonstrations, and acquiring a new one, through more data or reinforcement learning, is costly to repeat for every task.

Yet new skills are rarely fully novel: they recombine primitives the policy already knows. Sweeping and scooping share approach and lowering, differing only in a lateral push; flipping a block reuses pick-and-place's grasp-and-lift and adds a rotation. A standard VLA already encodes these primitives, but entangles them in a single task instruction, so they cannot be steered individually.

InSight makes the primitives steerable and uses a VLM as an active agent , not just a test-time planner over a fixed skill set, but one that flags the primitive a task is missing, drives the robot to acquire it, and retrains it back into the policy. Acquired skills then persist and recombine for future tasks, enabling continual learning.

How InSight Works

Stage 1 Primitive steerability

Human demonstrations are automatically segmented into primitive-labeled trajectories by aligning a VLM-generated plan with gripper transitions and end-effector motion. Fine-tuning on these labels produces a VLA that can be steered one primitive at a time.

Stage 2 Skill acquisition

For a novel task, the VLM flags any primitive gap, drives a low-level controller to attempt it, and verifies success with a VLM oracle. Successful rollouts are labeled, stored, and used to retrain the VLA, forming a data flywheel that grows the skill set.

Stage 1. A demonstration is split into labeled primitives using gripper-state and dominant-motion cues.

Stage 2. The VLM identifies and parameterizes a missing primitive, the robot executes it, and a VLM oracle verifies success.

Implementation. InSight fine-tunes a π0.5 VLA with LoRA, and uses Gemini 3 Flash as the VLM across four roles: demonstration segmentation, task planning, primitive-gap proposal, and image-based success checking. The framework is agnostic to the underlying VLA.

compose (climax) ============================================================ -->

Acquiring New Skills

Each skill below is acquired with no human demonstrations of that skill . Starting from demonstrations of a different task, InSight identifies the missing primitives, practices them, and folds them back into the policy. On the left is what the robot was trained on; on the right is the new skill it acquires.

Block flipping from pick-and-place demos only · 8× speed

Trained onpick-and-place

→

Acquired (no human demos)block flipping (rotate-block primitive)

With no human demonstrations of the flip, InSight practices the missing rotate-block primitive and climbs to 75% block-flip success over 479 rollouts. An RL baseline (SAC) given the same budget never completes a flip (0%...

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI