Long-Form Video Understanding – Part 2: Evaluation and Benchmarks

rented_mule1 pts0 comments

Long-Form Video Understanding - Part 2: Evaluation and Benchmarks

Yinghong Lan

SubscribeSign in

Long-Form Video Understanding - Part 2: Evaluation and Benchmarks<br>Why long-video benchmarks cluster in the corner that's cheap to grade - and why the judge that could grade the rest is stuck there too.

Yinghong Lan<br>Jun 17, 2026

Share

In my previous writeup, I demystified Multimodal LLM-as-a-Judge; in Part 1 on long-form video understanding, I discussed current bottlenecks and design choices. This post finishes the small trilogy, diving into evaluation and benchmarks, and tying the other two together.<br>Before we start, let me define benchmarks and evaluation:<br>Thanks for reading! Subscribe for free to receive new posts and support my work.

Subscribe

A benchmark is typically two things: what you are measuring - a set of tasks; and how you measure - a way to score answers against those tasks. It’s just one component of evaluation.

Evaluation is the harder question of whether that score is actually useful to you. A benchmark can mislead on either side:<br>Tasks may not require the capability you care about - e.g. answerable from one frame or the subtitles alone, when your use case needs real long-form understanding.

Scoring - e.g. LLM judges for open-ended answers - may not be calibrated to your use case.

The rest of this writeup dissects both:<br>Benchmark survey: I categorize benchmarks by task and scoring complexity, and point out clear gaps and the reasons behind them.

Evaluation and Agentic Judge: there are clear gaps in open-ended, multi-hop questions - which is where a judge becomes essential. I connect back to my previous writeup.

Judge for training: I close with the training-time uses of a judge - for example, the agentic methods from Part 1.

Benchmarks for long-form video understanding

Two different kinds of property matter here:<br>Task axes - how hard the answer is to produce: where the evidence sits (scope), and how many reasoning steps it takes (depth).

Scoring axes - how hard the answer is to verify: whether it’s closed or open-ended, and whether it’s objective or subjective.

And, as we’ll see, the two can be quite correlated.<br>Task axes: evidence scope and compositional depth

The earlier survey by Tang et al. slices video tasks by granularity - abstract (e.g., video classification) vs. temporal (e.g., moment retrieval) vs. spatiotemporal (e.g., object tracking), along with varying degrees of language involvement. At a high level, this captures where the relevant evidence sits (scope ). It’s a useful axis, but on its own it can mislead: where the evidence sits doesn’t tell you how hard the task is.<br>I think there’s a second, equally important dimension: how much planning and decomposition a task takes (depth ) - i.e., how many steps the model has to chain together. Take two questions that both need the whole video (same scope), and both expect closed, objective answers:<br>“Does anyone in a red coat ever appear?” - a presence check: find one match and stop - shallow.

“How many distinct people appear, counting each only once?” - scan the whole video, re-identify, dedup, and tally. Several dependent steps - deep.

So I propose two task axes:<br>Evidence scope - where the relevant evidence sits: single frame → single clip → dispersed segments → whole video.

Compositional depth - how many steps the model must chain:<br>L0 - Perceive: the relevant moment is specified; just recognize what’s there.

L1 - Locate / retrieve: the target is specified; find whether and where it occurs.

L2 - Single-hop reason: find the relevant evidence and reason once to an answer.

L3 - Multi-hop / decompose: break into interdependent sub-questions, chain retrieve-then-reason steps, and aggregate.

Scoring axes: closed/open-ended and objective/subjective

To show that scoring is a separate dimension, take two questions of similar task difficulty:<br>“How many distinct people appear in the whole video?” - wide scope and multi-hop, but the answer is a number you can check automatically.

“Who are the distinct people, and what distinguishes each?” - similar underlying work, but now the answer is free-form text with no exact match; you need a mechanism to read and judge it.

So I propose two scoring axes:<br>Closed vs. open - can you auto-score it (exact match, tIoU), or do you need a model to read the answer (a judge)?

Objective vs. subjective - is there a single defensible answer, or legitimate variation? (”Is the cinematography good?” - even a perfect judge can’t guarantee agreement without more objective standards.)

It's worth noting that these axes together trace the line that separates verifiable from non-verifiable rewards in reinforcement learning.<br>Benchmark categorization

I categorize the major long-video benchmarks along all four axes below: scope and depth are a center-of-mass range (most benchmarks span several rungs); format and ground truth reflect each benchmark's primary scoring mode. (The full list of the 16 benchmarks...

video benchmarks judge long evaluation axes

Related Articles