Generative Dynamic Gaussian Reconstruction from Monocular Video

World from Motion

World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video

Liyuan Zhu1,2 Shengyu Huang2 Amrita Mazumdar2 Tianye Li2 Zan Gojcic2

Gordon Wetzstein1 Iro Armeni1 Shalini De Mello2 Alex Trevithick2

1Stanford University 2NVIDIA

Paper Interactive Viewer Pipeline

Scroll

Abstract

Generative 4D reconstruction from monocular video

World from Motion improves dynamic 3D Gaussian reconstructions by using a video generator as a controllable prior. We condition generation on a persistent 4D representation, sample new dynamic viewpoints, and distill the generated observations back into the reconstruction.

Park

Basketball

Store

Dancer

Robot

Astronaut

Street

Gallery

Initial

WfM

Pause

0:00 / 0:00

Side-by-side overlay of the initial reconstruction and the WfM result.

Interactive Viewer

Explore the dynamic Gaussian reconstructions

Browser-based 4D Gaussian previews with scene switching and camera controls.

Results

Quantitative Results

4DGS Reconstruction Camera-Controlled Video Generation View Sampling Motion Guidance

Table 1

State-of-the-art 4D Reconstruction

4D Reconstruction Benchmark on DyCheck

Method Covisible mPSNR ↑ Covisible mSSIM ↑ Covisible mLPIPS ↓

Shape-of-Motion17.320.5980.296 MoSca19.320.7060.264 WorldTree19.750.7280.240 ViDAR19.690.7130.223 World-from-Motion20.260.7320.215

Table 2

Conditioning on a persistent 4D representation produces the best camera control.

4D Novel-View Synthesis Benchmark on DyCheck

Method mPSNR ↑ mSSIM ↑ mLPIPS ↓

ReCamMaster10.960.2620.755 GEN3C12.060.2600.679 TrajectoryCrafter13.060.3200.656 Vista4D14.140.3100.514 World-from-Motion18.450.6350.362

Table 3

The More Views We Sample, the Better Reconstruction We Get

Virtual-camera ablation with mPSNR, mSSIM, and mLPIPS from the paper table.

mPSNR ↑

18.6919.3519.5219.6319.78

01248

mSSIM ↑

0.6960.7030.7060.7080.711

01248

mLPIPS ↓

0.2720.1760.1840.1840.181

01248

Table 4

WfM improves the 3D motion

PCK@0.05 compares track quality on DyCheck.

HyperNeRF 0.453

CoTracker 0.803

Gauss.Marbles 0.806

BootsTAPIR 0.779

MoSca 0.824

Ours 0.862

Table 5

Reconstruction Guidance

Inference-time guidance on dense 4D scaffold rendering improves both fidelity and accuracy.

mPSNR ↑

19.3019.3519.4019.4519.5019.55 12345

G-Buffer Guidance Scale mPSNR (↑)

mLPIPS ↓

0.2250.2300.2350.2400.245 12345

G-Buffer Guidance Scale mLPIPS (↓)

No guidance CFG APG (x0, thr=8, η=0) APG (vel, thr=32, η=0.3)

Table 4

Generative Methods: CAT4D-Comparable Setting

Metrics reported in the CAT4D-comparable evaluation setting.

mPSNR ↑

CAT4D 18.24

Ours 19.89

mSSIM ↑

CAT4D 0.666

Ours 0.715

mLPIPS ↓

CAT4D 0.227

Ours 0.197

-->

Acknowledgements

We thank Yang Zheng, Zhengfei Kuang, Lior Yariv, and Jianhao Zheng for fruitful discussions. We also thank Yijia Weng and Jiahui Lei for providing evaluation details for MoSca, Kuan Heng Lin for providing Vista4D evaluation details, and Michal Nazarczuk and Eduardo Pérez-Pellitero for providing evaluation details for ViDAR. This website builds on the templates from RealmDreamer and CAT4D.

BibTeX

@misc{zhu2026worldfrommotion, title = {World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video}, author = {Liyuan Zhu and Shengyu Huang and Amrita Mazumdar and Tianye Li and Zan Gojcic and Gordon Wetzstein and Iro Armeni and Shalini De Mello and Alex Trevithick}, year = {2026}

Generative Dynamic Gaussian Reconstruction from Monocular Video

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI