AFUN: Towards an Affordance Foundation Model for Functionality Understanding
AFUN: Towards an Affordance Foundation<br>Model for Functionality Understanding
Zhaoning Wang1,*,
Yi Zhong1,*,
Jiawei Fu2,
Henrik I. Christensen2,
Jun Gao1,3
1University of Michigan<br>2University of California, San Diego<br>3NVIDIA
*Equal contribution
Paper
arXiv
Video
Code
Where + How for Functionality Understanding
A single forward pass predicts both a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact).
SOTA Affordance Segmentation
+23.9 / +26.3 mean gIoU/cIoU over the best baseline, across 8 test sets from 4 affordance benchmarks.
Largest Public Affordance Data
One of the largest public affordance datasets to date: robot, human egocentric, simulation, and real-world scan data.
We present AFUN , a step toward an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, AFUN predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels.
Abstract
Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability.
In this paper, we present AFUN , a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, AFUN predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels.
We evaluate AFUN from three aspects: for affordance segmentation, AFUN outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3 ; for contact-point prediction, it predicts substantially more accurate points, with a 12.7–61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. AFUN can be deployed for real-world robot manipulation without finetuning for robot embodiment, demonstrating the ability to adapt to open-world affordance tasks.
-->
Prediction Results
AFUN predictions across diverse scenes. Pick a scene below to see AFUN's prediction for every language query in that scene, side by side. Points inside the predicted affordance mask are highlighted in red, and the trajectory threads from yellow (contact) to blue (end). drag to orbit, scroll to zoom.
Start
End
Real-Robot Deployment
Without any robot-specific finetuning, AFUN predicts a precise functional mask and 3D motion that the robot uses to plan and execute manipulation in the real world. The same model generalizes across object categories, language instructions, and embodiments, suggesting a practical path toward open-world affordance models that unify functionality perception with executable action.
‹
›
Method Overview
Given an RGB-D observation and a language task description, AFUN jointly predicts where to interact (a task-conditional functional segmentation mask) and how to interact (a 3D post-contact motion represented as a Bézier spline curve). The model routes pretrained vision–language features through lightweight metaqueries into a segmentation decoder for the mask and a curve head for the 3D motion, leveraging strong visual–language, segmentation, and 3D geometric priors with lightweight trainable modules—enabling joint mask and motion prediction without finetuning the large backbones.
Data Pipeline
We build a unified data pipeline that converts heterogeneous robot, human egocentric, simulation, and real-world scan data into a shared affordance schema with language task phrases, functional masks, and object-centric 3D motion labels. Rather than approximating object motion via hand or gripper proxies, we track the object itself through depth-fused mask propagation, yielding on-object 3D trajectories at scale and producing one of the largest public affordance datasets to...