Long-Form Video Understanding – Part 2: Evaluation and Benchmarks

Long-Form Video Understanding - Part 2: Evaluation and Benchmarks

Yinghong Lan

SubscribeSign in

Long-Form Video Understanding - Part 2: Evaluation and Benchmarks Why long-video benchmarks cluster in the corner that's cheap to grade - and why the judge that could grade the rest is stuck there too.

Yinghong Lan Jun 17, 2026

In my previous writeup, I demystified Multimodal LLM-as-a-Judge; in Part 1 on long-form video understanding, I discussed current bottlenecks and design choices. This post finishes the small trilogy, diving into evaluation and benchmarks, and tying the other two together. Before we start, let me define benchmarks and evaluation: Thanks for reading! Subscribe for free to receive new posts and support my work.

A benchmark is typically two things: what you are measuring - a set of tasks; and how you measure - a way to score answers against those tasks. It’s just one component of evaluation.

Evaluation is the harder question of whether that score is actually useful to you. A benchmark can mislead on either side: Tasks may not require the capability you care about - e.g. answerable from one frame or the subtitles alone, when your use case needs real long-form understanding.

Scoring - e.g. LLM judges for open-ended answers - may not be calibrated to your use case.

The rest of this writeup dissects both: Benchmark survey: I categorize benchmarks by task and scoring complexity, and point out clear gaps and the reasons behind them.

Evaluation and Agentic Judge: there are clear gaps in open-ended, multi-hop questions - which is where a judge becomes essential. I connect back to my previous writeup.

Judge for training: I close with the training-time uses of a judge - for example, the agentic methods from Part 1.

Benchmarks for long-form video understanding

Two different kinds of property matter here: Task axes - how hard the answer is to produce: where the evidence sits (scope), and how many reasoning steps it takes (depth).

Scoring axes - how hard the answer is to verify: whether it’s closed or open-ended, and whether it’s objective or subjective.

And, as we’ll see, the two can be quite correlated. Task axes: evidence scope and compositional depth

The earlier survey by Tang et al. slices video tasks by granularity - abstract (e.g., video classification) vs. temporal (e.g., moment retrieval) vs. spatiotemporal (e.g., object tracking), along with varying degrees of language involvement. At a high level, this captures where the relevant evidence sits (scope ). It’s a useful axis, but on its own it can mislead: where the evidence sits doesn’t tell you how hard the task is. I think there’s a second, equally important dimension: how much planning and decomposition a task takes (depth ) - i.e., how many steps the model has to chain together. Take two questions that both need the whole video (same scope), and both expect closed, objective answers: “Does anyone in a red coat ever appear?” - a presence check: find one match and stop - shallow.

“How many distinct people appear, counting each only once?” - scan the whole video, re-identify, dedup, and tally. Several dependent steps - deep.

So I propose two task axes: Evidence scope - where the relevant evidence sits: single frame → single clip → dispersed segments → whole video.

Compositional depth - how many steps the model must chain: L0 - Perceive: the relevant moment is specified; just recognize what’s there.

L1 - Locate / retrieve: the target is specified; find whether and where it occurs.

L2 - Single-hop reason: find the relevant evidence and reason once to an answer.

L3 - Multi-hop / decompose: break into interdependent sub-questions, chain retrieve-then-reason steps, and aggregate.

Scoring axes: closed/open-ended and objective/subjective

To show that scoring is a separate dimension, take two questions of similar task difficulty: “How many distinct people appear in the whole video?” - wide scope and multi-hop, but the answer is a number you can check automatically.

“Who are the distinct people, and what distinguishes each?” - similar underlying work, but now the answer is free-form text with no exact match; you need a mechanism to read and judge it.

So I propose two scoring axes: Closed vs. open - can you auto-score it (exact match, tIoU), or do you need a model to read the answer (a judge)?

Objective vs. subjective - is there a single defensible answer, or legitimate variation? (”Is the cinematography good?” - even a perfect judge can’t guarantee agreement without more objective standards.)

It's worth noting that these axes together trace the line that separates verifiable from non-verifiable rewards in reinforcement learning. Benchmark categorization

I categorize the major long-video benchmarks along all four axes below: scope and depth are a center-of-mass range (most benchmarks span several rungs); format and ground truth reflect each benchmark's primary scoring mode. (The full list of the 16 benchmarks...

Long-Form Video Understanding – Part 2: Evaluation and Benchmarks

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews