Déjà View: Looping Transformers for Multi-View 3D Reconstruction

1B-parameter feed-forward baselines and exposes inference compute as a slider.">1B-parameter feed-forward baselines and exposes inference compute as a slider.">Déjà View TL;DR

Déjà View reconstructs camera poses and dense geometry from any number of views by applying the same transformer block in a looped fashion. A single checkpoint covers a range of step counts, so users can dial compute up or down at inference time. At only 117M parameters, Déjà View beats much larger feed-forward baselines while using 8–10× fewer parameters and 1.9–2.3× less compute (π³, Depth Anything 3 — G) on five reconstruction benchmarks spanning indoor and outdoor scenes.

Reconstruction Output RGB Input NVIDIA HQ Canyon

Abstract

Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture.

Our model, Déjà View , applies a single looped transformer block recurrently to per-view features for KKK refinement steps. Trained once, it exposes KKK as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.

DéjàView. Given multiple input views (top-left), DéjàView reconstructs camera poses and consistent depth by repeatedly applying the same transformer block, with the number of refinement steps K exposed as an inference-time compute knob. Decoding the intermediate state of a single K=16 forward pass at iterations k ∈ 16 shows progressively sharper geometry and more accurate camera poses (right; frustums are colored by per-camera error). Across five benchmarks (bottom-left), DéjàView matches or surpasses much larger feed-forward baselines at a small fraction of their parameter count (dot area).

Interactive viewer

Slide the iteration control inside the viewer to watch the same forward pass refine itself. Frustums are colored by per-camera pose error after Sim(3) alignment to ground-truth poses.

Method

Déjà View initializes per-view features from a pretrained DINOv2 encoder and applies a single transformer block — with frame and global attention sub-blocks — recurrently KKK times to refine the state, with each application conditioned on its continuous time interval (tk,tk+1)(t_k, t_{k+1})(tk,tk+1). Because KKK is sampled per batch from [Kmin,Kmax][K_\text{min}, K_\text{max}][Kmin,Kmax] during training, one trained checkpoint covers any step count in that range at inference. Two lightweight heads then decode the final state into per-view depth and ray maps.

Results

At 117M parameters, Déjà View leads average inlier ratio and pose AUC@30° across all five benchmarks at the smallest parameter count of any baseline. Bubble area is proportional to parameter count.

100M300M1000MParameters (log scale)55606570758085Avg. Inlier Ratio (%)Bubble area ∝ parametersPi3· 959M VGGT· 1257M DA3-L· 356M DA3-G· 1201M Déjà View· 117M

100M300M1000MParameters (log scale)7580859095Avg. Pose AUC@30° (%)Pi3· 959M VGGT· 1257M DA3-L· 356M DA3-G· 1201M Déjà View· 117M

Qualitative comparison

Left viewer shows Déjà View . Use the dropdown on the right to compare against recent feed-forward baselines on the same sequence. Pick a different example using the strip below. All sequences are visualized discarding the bottom 25% of points by predicted confidence (if available).

Déjà View · 117M Depth Anything 3 — G · 1201MDepth Anything 3 — L · 356Mπ³ · 959MVGGT · 1257MVGGT-Ω · 1144M

Point size 1.00 px

Citation

@misc{burzio2026dejaview,

title = {D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction},

author = {Burzio, Alessandro and Fischer, Tobias and Elflein, Sven and Zhou, Qunjie and de Lutio, Riccardo and Ren, Jiawei and Huang, Jiahui and Huang, Shengyu and Pollefeys, Marc and Leal-Taix\'e, Laura and Gojcic, Zan and Turki, Haithem},

year = {2026},

eprint = {2605.30215},

archivePrefix = {arXiv},

primaryClass = {cs.CV},

url = {https://arxiv.org/abs/2605.30215},

Acknowledgements

We thank our colleagues at NVIDIA for valuable discussions and feedback.

Déjà View: Looping Transformers for Multi-View 3D Reconstruction

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy