Long-Form Video Understanding: Bottlenecks and Design Choices

Long-Form Video Understanding: Bottlenecks and Design Choices - Part 1

Yinghong Lan

SubscribeSign in

Long-Form Video Understanding: Bottlenecks and Design Choices - Part 1 A field guide to long-form video understanding: the design space across two axes - memory, from discard to keep; and compute, from external agents to agentified models.

Yinghong Lan Jun 15, 2026

Bottlenecks: memory, compute, evaluation

Recently I have been hearing and reading seemingly contradicting opinions on long-form video (from tens of minutes to several hours) understanding, such as: “Sweeping through the whole video is necessary - we should focus on making that as efficient as possible” vs. “there are many clever tricks to selectively retrieve - let’s explore those.”

“We should just keep improving MLLMs until they can handle everything” vs. “agents are the future of video understanding - let’s build more agent swarms.”

My thesis is that these views are not really disagreeing about what is true - they are making different tradeoffs about where to spend limited “budget” . Unlike a text document or dozens of images, a two-hour video breaks the memory and compute budget in the absence of intentional compressing, sampling, or retrieving. The contradictions above reflect different design choices to solve this core challenge, and there is no consensus yet on a universally optimal design. Quite the contrary - two distinct axes of tradeoff are being actively explored: Thanks for reading! Subscribe for free to receive new posts and support my work.

The memory axis. When you cannot afford to attend to everything, do you throw information away and lean on adaptive retrieval - or do you keep all the information and compress the attention/KV-cache? Two different answers to the same memory ceiling.

The compute axis. When you cannot yet compute the answer accurately in one pass, do you buy accuracy with an agentic system that runs many inferences - or do you internalize that agentic behavior into the model itself, so it's a learned, native capability rather than an external orchestration loop?

And there is a third bottleneck that is, as always, evaluation . The problem is some benchmarks do not control for the complexity and dependency of the tasks they bundle together, which makes “approach A beats approach B” claims much weaker than they look: Complexity isn’t controlled. Even the benchmarks marketed as “long” rarely exceed an hour, while real production workloads often run for hours.

Dependency isn’t controlled. Plenty of “video” questions are anchored on a single frame or a few seconds, or are answerable from the transcript/subtitle alone, with no real long-range understanding required.

To keep this writeup focused and readable, I’ll use the rest of it to survey the design choices along these two axes, and leave evaluation and benchmarks for long-form video understanding to a separate future writeup (Part 2). Sidebar clarification: technically one could push a two-hour video into a long-context model like Gemini. But it does not work reliably for tasks that genuinely require long-form temporal understanding, e.g., “what is the story arc for the character first appearing between 20:02 ~ 20:22 min in a blue coat?” While it’s hard to know the full details of these closed models, it is reasonable to conjecture that they are leaning on subtitles/ASR, metadata, or a handful of frames for most questions. As shown in recent benchmarks built from movies and TV shows (InfiniBench, Ataallah et al. 2025), models can score pretty well for certain tasks purely based on subtitles and metadata (related to the aforementioned issue of “dependency isn’t controlled” ). In summary, “feed the whole video to a frontier model” is not a silver bullet for long-form video understanding. Memory design choice: discard vs. keep

Of course, solutions are not as simple as keep nothing vs. keep everything - there is a full spectrum, running from aggressively throwing information away to keeping all of it and paying the cost somewhere else. Aggressively discard: adaptive selection of frames, clips, and patches

Instead of uniform sampling, prev work has proposed selecting only what matters: Lightweight learned selectors of frames (M-LLM Based Video Frame Selection, Hu et al. 2025)

Training-free key clip selection that keeps short coherent segments instead of isolated frames (From Frames to Clips, Sun et al. 2025)

RL for samplers (Temporal Sampling Policy Optimization, Tang et al. 2025) where an event-aware “temporal agent” is trained for keyframe selection

Reasoning driven sampling that traverses coarse summaries, refines its focus, and halts once it has enough evidence (LongVideo-R1, Qiu et al. 2026)

Joint RL training of the sampler and the model (MSJoE, Tan et al. 2026)

There is a second, somewhat orthogonal design choice here: most selectors are “query conditioned”, meaning they will select frames / clips based on the user question;...

Long-Form Video Understanding: Bottlenecks and Design Choices – Part 1

Related Articles

Long-Form Video Understanding: Bottlenecks and Design Choices – Part 1

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y