IMAGIN-4D: Image-Guided Controllable Interaction Generation
IMAGIN-4D:<br>Image-Guided Controllable Interaction Generation
Anonymous Submission
Anonymous Authors
Paper
Video
Image-conditioned 4D HOI generation.<br>Given a text prompt, object geometry, object waypoints, and a<br>reference image, IMAGIN-4D synthesizes a 4D human-object<br>interaction sequence. Text and waypoints specify the action and<br>object trajectory, but leave fine-grained interaction details such<br>as pose, contact, and layout ambiguous. We resolve this ambiguity<br>with a reference image that specifies the interaction snapshot.<br>To test whether IMAGIN-4D follows this visual evidence, we keep<br>the text prompt, object geometry, and waypoints fixed, and mirror<br>only the reference image. IMAGIN-4D generates different motions<br>that satisfy the corresponding snapshot: body pose, object pose,<br>contact, and body-object layout change consistently with the<br>mirrored reference. This is achieved through spatio-temporal image<br>conditioning, which separates spatial cues for the depicted<br>interaction state from frame-aware cues for the surrounding<br>motion. Unlike single-token image conditioning, this preserves<br>fine-grained visual evidence while generating the HOI sequence.
Video
Your browser does not support HTML5 video.
Abstract
Generating human-object interactions (HOI) is central to<br>character animation, robotics, AR/VR, and embodied AI. Recent<br>HOI generation methods synthesize motion from text, object<br>geometry, and sparse waypoints, controlling action semantics<br>and object trajectories. However, these signals underspecify<br>interaction: the same prompt and trajectory can produce<br>different grasps, approach directions, body poses, object<br>poses, contacts, and body-object layouts. We address this<br>ambiguity with a reference image as a visual specification of<br>the desired interaction snapshot. However, a single global<br>image representation conflates distinct cues and conditions<br>all frames on identical visual evidence.
We therefore introduce IMAGIN-4D , a diffusion-based<br>HOI generator that decomposes image conditioning<br>spatio-temporally. For spatial conditioning, IMAGIN-4D<br>extracts supervised interaction-state tokens for body pose,<br>object pose, body-object contact, and spatial relationships<br>at the depicted frame. For temporal conditioning, it computes<br>frame-aware tokens by querying image patches per generated<br>frame, allowing sequence segments to attend to different<br>visual cues from the same image. To balance image, text, and<br>waypoint cues, IMAGIN-4D uses role-aware conditioning: text,<br>waypoints, and interaction-state tokens use separate AdaLN<br>streams, while frame-aware visual tokens cross-attend with<br>motion tokens.
Since HOI motion datasets lack paired images, we build a<br>synthetic motion-to-image rendering pipeline from<br>FullBodyManipulation (FBM) and introduce an image-adherence<br>metric to evaluate whether generated motions match the<br>reference snapshot. Experiments on FBM and BEHAVE show that<br>IMAGIN-4D improves fine-grained interaction control over<br>single-token and uniformly image-conditioned baselines while<br>preserving waypoint-following and motion quality.
Method Overview
Given a reference image $\mathcal{I}$, text prompt $y$, object<br>geometry $\mathcal{O}$, and sparse waypoints $\mathcal{W}$,<br>IMAGIN-4D generates a 4D human-object motion sequence. A frozen<br>image encoder extracts patch tokens $\mathbf{P}$ from<br>$\mathcal{I}$. The Spatially Factorized Image Encoder (SFIE)<br>reads these patches with role-specific queries and produces<br>supervised latent tokens for contact $\boldsymbol{\kappa}$,<br>human pose $\boldsymbol{\rho}$, object pose $\boldsymbol{\xi}$,<br>and body-object spatial relation $\boldsymbol{\nu}$. These<br>tokens are trained to match role-autoencoder latents derived<br>from the paired motion sequence. Their concatenated summary<br>$\bar{\boldsymbol{\zeta}}$ predicts the reference frame<br>$\hat{t}$ depicted by the image. In parallel, the Frame-Aware<br>Image Encoder re-queries $\mathbf{P}$ with frame- and<br>text-conditioned queries to produce per-frame visual tokens<br>$\boldsymbol{\mu}_t$. The motion denoiser routes conditions by<br>role: base conditioning, waypoint features, and window-gated<br>spatial image evidence modulate transformer layers through<br>separate AdaLN streams, while $\boldsymbol{\mu}_t$ enters<br>through late cross-attention. Sampling-time guidance improves<br>image adherence.
How Image Controls Motion Generation
Ref. Img
Ours
"The person moves the table on the floor"
Mirrored Ref. Img
Ours
"The person moves the table on the floor"
Ref. Img
Ours
"Pull the trashcan and set it down"
Mirrored Ref. Img
Ours
"Pull the trashcan and set it down"
Ref. Img
Ours
"Push the large box and set it down"
Mirrored Ref. Img
Ours
"Push the large box and set it down"
Mirrored-reference consistency test.<br>We horizontally mirror only the reference image at inference time<br>while keeping the text prompt, object geometry, and waypoints<br>fixed. The generated contact side and body-object...