The first benchmark to test AI agent's video editing capability

ameddserM1 pts1 comments

AgenticVBench: Can AI agents do real-world post-production work?<br>github ↗

May 2026<br>Can AI agents do real-world post-production work?<br>We gave the 7 best frontier models 100 expert-authored tasks across the four stages of post-production. The best agent barely crosses 30%. Human experts scored 89%.<br>Read the paperLeaderboardCode & dataTasksDiscord<br>100<br>Tasks

20<br>Industry experts

Frontier models

Task families

Why this benchmark exists<br>Verification is not here for free.<br>RLVR works in math and code because centuries of humanistic work built the verifiers, the bill was paid before we got there. Creative work hasn't paid that bill. AgenticVBench is what paying it looks like in film.<br>Read the full essay →

Leaderboard preview<br>Top 5 model × harness combinations.

View full leaderboard →<br>RankAgentAvgRepurposeSeqRepairAssembly·Human expertsreference88.5%95%90%88%81%1GPT-5.5· Codex31.0%± 4.030%26%30%38%2GPT-5.5· OpenCode27.4%± 3.527%20%27%37%4Claude Opus 4.7· Claude Code22.1%± 3.530%20%17%22%5GPT-5.5· OpenClaw21.9%± 2.920%29%21%18%6Claude Opus 4.7· OpenClaw21.1%± 3.418%19%25%22%

What the bench tests<br>Four task families spanning the real-world post-production workflow.<br>Authored by 20 industry experts averaging 6 years of post-production experience. Tasks span 30 minutes to one week of human work.<br>Assembly<br>18 tasks

43<br>pp gap

Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot.<br>Best agent 38%Human 81%

Repair<br>18 tasks

59<br>pp gap

Given a video with defects (frozen scene, scene swap, color drift, or audio noise), localize them and produce a fixed cut.<br>Best agent 30%Human 88%

Sequencing<br>28 tasks

61<br>pp gap

Given a brief story overview and a shuffled set of clips, recover the correct narrative order.<br>Best agent 29%Human 90%

Repurpose<br>36 tasks

65<br>pp gap

Given 4-150 minutes of source video and a creative brief, repurpose it into a short deliverable that follows the brief and preserves the story.<br>Best agent 30%Human 95%

The harness finding<br>The harness matters as much as the model.<br>Holding the model fixed and varying the harness shifts GPT-5.5's Assembly score by 20 percentage points, comparable to the gap between adjacent models on the leaderboard.<br>Most benchmarks today are still model-based. The data here says that's wrong. Agent performance is determined by both the model and the scaffolding around it. Reporting only the model misses the larger story.<br>Agent = model × harness.

GPT-5.5 on Assembly · score by harness<br>Codex

38%<br>OpenCode

37%<br>OpenClaw

18%

Same model. 20-point swing.

agent model tasks human best harness

Related Articles