AgenticVBench: Can AI agents do real-world post-production work?<br>github ↗
May 2026<br>Can AI agents do real-world post-production work?<br>We gave the 7 best frontier models 100 expert-authored tasks across the four stages of post-production. The best agent barely crosses 30%. Human experts scored 89%.<br>Read the paperLeaderboardCode & dataTasksDiscord<br>100<br>Tasks
20<br>Industry experts
Frontier models
Task families
Why this benchmark exists<br>Verification is not here for free.<br>RLVR works in math and code because centuries of humanistic work built the verifiers, the bill was paid before we got there. Creative work hasn't paid that bill. AgenticVBench is what paying it looks like in film.<br>Read the full essay →
Leaderboard preview<br>Top 5 model × harness combinations.
View full leaderboard →<br>RankAgentAvgRepurposeSeqRepairAssembly·Human expertsreference88.5%95%90%88%81%1GPT-5.5· Codex31.0%± 4.030%26%30%38%2GPT-5.5· OpenCode27.4%± 3.527%20%27%37%4Claude Opus 4.7· Claude Code22.1%± 3.530%20%17%22%5GPT-5.5· OpenClaw21.9%± 2.920%29%21%18%6Claude Opus 4.7· OpenClaw21.1%± 3.418%19%25%22%
What the bench tests<br>Four task families spanning the real-world post-production workflow.<br>Authored by 20 industry experts averaging 6 years of post-production experience. Tasks span 30 minutes to one week of human work.<br>Assembly<br>18 tasks
43<br>pp gap
Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot.<br>Best agent 38%Human 81%
Repair<br>18 tasks
59<br>pp gap
Given a video with defects (frozen scene, scene swap, color drift, or audio noise), localize them and produce a fixed cut.<br>Best agent 30%Human 88%
Sequencing<br>28 tasks
61<br>pp gap
Given a brief story overview and a shuffled set of clips, recover the correct narrative order.<br>Best agent 29%Human 90%
Repurpose<br>36 tasks
65<br>pp gap
Given 4-150 minutes of source video and a creative brief, repurpose it into a short deliverable that follows the brief and preserves the story.<br>Best agent 30%Human 95%
The harness finding<br>The harness matters as much as the model.<br>Holding the model fixed and varying the harness shifts GPT-5.5's Assembly score by 20 percentage points, comparable to the gap between adjacent models on the leaderboard.<br>Most benchmarks today are still model-based. The data here says that's wrong. Agent performance is determined by both the model and the scaffolding around it. Reporting only the model misses the larger story.<br>Agent = model × harness.
GPT-5.5 on Assembly · score by harness<br>Codex
38%<br>OpenCode
37%<br>OpenClaw
18%
Same model. 20-point swing.