Imagin-4D: Image-Guided Controllable Interaction Generation

IMAGIN-4D: Image-Guided Controllable Interaction Generation

IMAGIN-4D: Image-Guided Controllable Interaction Generation

Anonymous Submission

Anonymous Authors

Paper

Video

Image-conditioned 4D HOI generation. Given a text prompt, object geometry, object waypoints, and a reference image, IMAGIN-4D synthesizes a 4D human-object interaction sequence. Text and waypoints specify the action and object trajectory, but leave fine-grained interaction details such as pose, contact, and layout ambiguous. We resolve this ambiguity with a reference image that specifies the interaction snapshot. To test whether IMAGIN-4D follows this visual evidence, we keep the text prompt, object geometry, and waypoints fixed, and mirror only the reference image. IMAGIN-4D generates different motions that satisfy the corresponding snapshot: body pose, object pose, contact, and body-object layout change consistently with the mirrored reference. This is achieved through spatio-temporal image conditioning, which separates spatial cues for the depicted interaction state from frame-aware cues for the surrounding motion. Unlike single-token image conditioning, this preserves fine-grained visual evidence while generating the HOI sequence.

Video

Your browser does not support HTML5 video.

Abstract

Generating human-object interactions (HOI) is central to character animation, robotics, AR/VR, and embodied AI. Recent HOI generation methods synthesize motion from text, object geometry, and sparse waypoints, controlling action semantics and object trajectories. However, these signals underspecify interaction: the same prompt and trajectory can produce different grasps, approach directions, body poses, object poses, contacts, and body-object layouts. We address this ambiguity with a reference image as a visual specification of the desired interaction snapshot. However, a single global image representation conflates distinct cues and conditions all frames on identical visual evidence.

We therefore introduce IMAGIN-4D , a diffusion-based HOI generator that decomposes image conditioning spatio-temporally. For spatial conditioning, IMAGIN-4D extracts supervised interaction-state tokens for body pose, object pose, body-object contact, and spatial relationships at the depicted frame. For temporal conditioning, it computes frame-aware tokens by querying image patches per generated frame, allowing sequence segments to attend to different visual cues from the same image. To balance image, text, and waypoint cues, IMAGIN-4D uses role-aware conditioning: text, waypoints, and interaction-state tokens use separate AdaLN streams, while frame-aware visual tokens cross-attend with motion tokens.

Since HOI motion datasets lack paired images, we build a synthetic motion-to-image rendering pipeline from FullBodyManipulation (FBM) and introduce an image-adherence metric to evaluate whether generated motions match the reference snapshot. Experiments on FBM and BEHAVE show that IMAGIN-4D improves fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint-following and motion quality.

Method Overview

Given a reference image $\mathcal{I}$, text prompt $y$, object geometry $\mathcal{O}$, and sparse waypoints $\mathcal{W}$, IMAGIN-4D generates a 4D human-object motion sequence. A frozen image encoder extracts patch tokens $\mathbf{P}$ from $\mathcal{I}$. The Spatially Factorized Image Encoder (SFIE) reads these patches with role-specific queries and produces supervised latent tokens for contact $\boldsymbol{\kappa}$, human pose $\boldsymbol{\rho}$, object pose $\boldsymbol{\xi}$, and body-object spatial relation $\boldsymbol{\nu}$. These tokens are trained to match role-autoencoder latents derived from the paired motion sequence. Their concatenated summary $\bar{\boldsymbol{\zeta}}$ predicts the reference frame $\hat{t}$ depicted by the image. In parallel, the Frame-Aware Image Encoder re-queries $\mathbf{P}$ with frame- and text-conditioned queries to produce per-frame visual tokens $\boldsymbol{\mu}_t$. The motion denoiser routes conditions by role: base conditioning, waypoint features, and window-gated spatial image evidence modulate transformer layers through separate AdaLN streams, while $\boldsymbol{\mu}_t$ enters through late cross-attention. Sampling-time guidance improves image adherence.

How Image Controls Motion Generation

Ref. Img

Ours

"The person moves the table on the floor"

Mirrored Ref. Img

Ours

"The person moves the table on the floor"

Ref. Img

Ours

"Pull the trashcan and set it down"

Mirrored Ref. Img

Ours

"Pull the trashcan and set it down"

Ref. Img

Ours

"Push the large box and set it down"

Mirrored Ref. Img

Ours

"Push the large box and set it down"

Mirrored-reference consistency test. We horizontally mirror only the reference image at inference time while keeping the text prompt, object geometry, and waypoints fixed. The generated contact side and body-object...

Imagin-4D: Image-Guided Controllable Interaction Generation

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI