Visual Representation Learning via Temporal Differences

John Carmack on X: "Paper review: You Don’t Need Strong Assumptions: Visual Representation Learning via Temporal Differences https://t.co/ETZPpL7vwr https://t.co/CtH0mYSKRA @AlexiGlad @ninaddaithankar

The premise is that the more data you can use, the fewer inductive biases you should have." / X Post

Log inSign up

Post

John Carmack

@ID_AA_Carmack

Paper review: You Don’t Need Strong Assumptions: Visual Representation Learning via Temporal Differences temporal-difference-vision.github.io temporal-difference-vision.github.io/static/pdfs/td… @AlexiGlad @ninaddaithankar

The premise is that the more data you can use, the fewer inductive biases you should have. Starting with strong priors is helpful with limited data, but eventually, architectural priors will hinder learning true knowledge buried in sufficiently large datasets. That sounds correct.

Concretely, the ad hoc image cropping / masking / augmentations used in self supervised representation learning all make assumptions about what is important in the images, and appendix A gives examples where they can be harmful.

Figure 3 looks very compelling for this argument, but if you notice that the X axis is log scale, it is sketchier; the anchoring values on the left are relative values from experiments on 0.1% of Imagenet, which I would expect to be quite high variance.

Instead of making multiple augmentations of an image for self supervision, this work uses sequential video frames as related image pairs for representation learning. They train two separate models: a frame encoder, and a “motion encoder” that takes the RGB subtraction between the sequential video frames to produce a delta vector. The models are jointly trained so that the first frame’s representation vector, added to the delta vector, will equal the second frame’s representation vector.

They use a fairly substantial 0.25 second stride between images in the pair, noting that too small of a stride results in near-zero differences in slow moving scenes, while too large of a stride gives incoherent pixel jumps.

It looks to me like the model should be stride independent, and they could simultaneously train on many different strides, increasing the dataset diversity.

I don’t like the DINO EMA teacher approach for avoiding collapse, I think SigReg would have been more direct.

The LeWorldModel work uses sequential video frames and SigReg, but it just minimizes latent distances between neighboring frames; you really want to predict (the ‘P’ in JEPA) from one latent to the next. Linear extrapolation based on the previous frame kind of works, but some level of conditioning on the current latent should be better.

Still, I’m unsure about the soundness of using frame subtraction to create the delta, since it has both frames entangled in it, so it really isn’t doing any kind of causal prediction. The architectural prior here is “only represent things that can be disentangled from a delta frame”, and I’m not sure that is universally valuable.

In their limitations section, they note that scaling to larger video datasets did not help their performance, but they expect better datasets and hyperparameter tuning will.

temporal-difference-vision.github.io You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences We introduce Temporal Difference in Vision (TDV), a self-supervised learning approach from video that relies only on the causal assumption that the past causes the future, matching or surpassing...

span:not(:empty)~span:not(:empty)]:before:content-['·'] [&>span:not(:empty)~span:not(:empty)]:before:px-1 [&>span:not(:empty)~span:not(:empty)]:before:shrink-0">2:43 AM · Jun 18, 202618KViews

:host{display:inline-block;direction:ltr;white-space:nowrap;line-height:1}span{display:inline-block}:host([data-will-change]) span{will-change:transform}.number,.digit{padding:round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) 0}.symbol{white-space:pre}9:where(number-flow-react){line-height:1}number-flow-react > span{font-kerning:none;display:inline-block;padding:calc(round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) * 2) 0}9 :host{display:inline-block;direction:ltr;white-space:nowrap;line-height:1}span{display:inline-block}:host([data-will-change]) span{will-change:transform}.number,.digit{padding:round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) 0}.symbol{white-space:pre}10:where(number-flow-react){line-height:1}number-flow-react > span{font-kerning:none;display:inline-block;padding:calc(round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) * 2) 0}10 :host{display:inline-block;direction:ltr;white-space:nowrap;line-height:1}span{display:inline-block}:host([data-will-change]) span{will-change:transform}.number,.digit{padding:round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) 0}.symbol{white-space:pre}73:where(number-flow-react){line-height:1}number-flow-react >...

Visual Representation Learning via Temporal Differences

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi