World from Motion
World from Motion:<br>Generative Dynamic Gaussian Reconstruction from Monocular Video
Liyuan Zhu1,2<br>Shengyu Huang2<br>Amrita Mazumdar2<br>Tianye Li2<br>Zan Gojcic2
Gordon Wetzstein1<br>Iro Armeni1<br>Shalini De Mello2<br>Alex Trevithick2
1Stanford University<br>2NVIDIA
Paper<br>Interactive Viewer<br>Pipeline
Scroll
Abstract
Generative 4D reconstruction from monocular video
World from Motion improves dynamic 3D Gaussian reconstructions by using a video generator as a<br>controllable prior. We condition generation on a persistent 4D representation, sample new dynamic<br>viewpoints, and distill the generated observations back into the reconstruction.
Park
Basketball
Store
Dancer
Robot
Astronaut
Street
Gallery
Initial
WfM
Pause
0:00 / 0:00
Side-by-side overlay of the initial reconstruction and the WfM result.
Interactive Viewer
Explore the dynamic Gaussian reconstructions
Browser-based 4D Gaussian previews with scene switching and camera controls.
Results
Quantitative Results
4DGS Reconstruction<br>Camera-Controlled Video Generation<br>View Sampling<br>Motion<br>Guidance
Table 1
State-of-the-art 4D Reconstruction
4D Reconstruction Benchmark on DyCheck
Method<br>Covisible mPSNR ↑<br>Covisible mSSIM ↑<br>Covisible mLPIPS ↓
Shape-of-Motion17.320.5980.296<br>MoSca19.320.7060.264<br>WorldTree19.750.7280.240<br>ViDAR19.690.7130.223<br>World-from-Motion20.260.7320.215
Table 2
Conditioning on a persistent 4D representation produces the best camera control.
4D Novel-View Synthesis Benchmark on DyCheck
Method<br>mPSNR ↑<br>mSSIM ↑<br>mLPIPS ↓
ReCamMaster10.960.2620.755<br>GEN3C12.060.2600.679<br>TrajectoryCrafter13.060.3200.656<br>Vista4D14.140.3100.514<br>World-from-Motion18.450.6350.362
Table 3
The More Views We Sample, the Better Reconstruction We Get
Virtual-camera ablation with mPSNR, mSSIM, and mLPIPS from the paper table.
mPSNR ↑
18.6919.3519.5219.6319.78
01248
mSSIM ↑
0.6960.7030.7060.7080.711
01248
mLPIPS ↓
0.2720.1760.1840.1840.181
01248
Table 4
WfM improves the 3D motion
PCK@0.05 compares track quality on DyCheck.
HyperNeRF<br>0.453
CoTracker<br>0.803
Gauss.Marbles<br>0.806
BootsTAPIR<br>0.779
MoSca<br>0.824
Ours<br>0.862
Table 5
Reconstruction Guidance
Inference-time guidance on dense 4D scaffold rendering improves both fidelity and accuracy.
mPSNR ↑
19.3019.3519.4019.4519.5019.55<br>12345
G-Buffer Guidance Scale<br>mPSNR (↑)
mLPIPS ↓
0.2250.2300.2350.2400.245<br>12345
G-Buffer Guidance Scale<br>mLPIPS (↓)
No guidance<br>CFG<br>APG (x0, thr=8, η=0)<br>APG (vel, thr=32, η=0.3)
Table 4
Generative Methods: CAT4D-Comparable Setting
Metrics reported in the CAT4D-comparable evaluation setting.
mPSNR ↑
CAT4D<br>18.24
Ours<br>19.89
mSSIM ↑
CAT4D<br>0.666
Ours<br>0.715
mLPIPS ↓
CAT4D<br>0.227
Ours<br>0.197
-->
Acknowledgements
We thank Yang Zheng, Zhengfei Kuang, Lior Yariv, and Jianhao Zheng for fruitful discussions. We<br>also thank Yijia Weng and Jiahui Lei for providing evaluation details for MoSca, Kuan Heng Lin<br>for providing Vista4D evaluation details, and Michal Nazarczuk and Eduardo Pérez-Pellitero for<br>providing evaluation details for ViDAR. This website builds on the templates from<br>RealmDreamer and<br>CAT4D.
BibTeX
@misc{zhu2026worldfrommotion,<br>title = {World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video},<br>author = {Liyuan Zhu and Shengyu Huang and Amrita Mazumdar and Tianye Li and Zan Gojcic and Gordon Wetzstein and Iro Armeni and Shalini De Mello and Alex Trevithick},<br>year = {2026}