AFUN: Towards an Affordance Foundation Model for Functionality Understanding

AFUN: Towards an Affordance Foundation<br>Model for Functionality Understanding

Zhaoning Wang1,*,

Yi Zhong1,*,

Jiawei Fu2,

Henrik I. Christensen2,

Jun Gao1,3

1University of Michigan<br>2University of California, San Diego<br>3NVIDIA

*Equal contribution

Paper

arXiv

Video

Code

Where + How for Functionality Understanding

A single forward pass predicts both a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact).

SOTA Affordance Segmentation

+23.9 / +26.3 mean gIoU/cIoU over the best baseline, across 8 test sets from 4 affordance benchmarks.

Largest Public Affordance Data

One of the largest public affordance datasets to date: robot, human egocentric, simulation, and real-world scan data.

We present AFUN , a step toward an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, AFUN predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels.

Abstract

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability.

In this paper, we present AFUN , a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, AFUN predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels.

We evaluate AFUN from three aspects: for affordance segmentation, AFUN outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3 ; for contact-point prediction, it predicts substantially more accurate points, with a 12.7–61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. AFUN can be deployed for real-world robot manipulation without finetuning for robot embodiment, demonstrating the ability to adapt to open-world affordance tasks.

-->

Prediction Results

AFUN predictions across diverse scenes. Pick a scene below to see AFUN's prediction for every language query in that scene, side by side. Points inside the predicted affordance mask are highlighted in red, and the trajectory threads from yellow (contact) to blue (end). drag to orbit, scroll to zoom.

Start

End

Real-Robot Deployment

Without any robot-specific finetuning, AFUN predicts a precise functional mask and 3D motion that the robot uses to plan and execute manipulation in the real world. The same model generalizes across object categories, language instructions, and embodiments, suggesting a practical path toward open-world affordance models that unify functionality perception with executable action.

&lsaquo;

&rsaquo;

Method Overview

Given an RGB-D observation and a language task description, AFUN jointly predicts where to interact (a task-conditional functional segmentation mask) and how to interact (a 3D post-contact motion represented as a Bézier spline curve). The model routes pretrained vision–language features through lightweight metaqueries into a segmentation decoder for the mask and a curve head for the 3D motion, leveraging strong visual–language, segmentation, and 3D geometric priors with lightweight trainable modules—enabling joint mask and motion prediction without finetuning the large backbones.

Data Pipeline

We build a unified data pipeline that converts heterogeneous robot, human egocentric, simulation, and real-world scan data into a shared affordance schema with language task phrases, functional masks, and object-centric 3D motion labels. Rather than approximating object motion via hand or gripper proxies, we track the object itself through depth-fused mask propagation, yielding on-object 3D trajectories at scale and producing one of the largest public affordance datasets to...

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org