The first benchmark to test AI agent's video editing capability

AgenticVBench: Can AI agents do real-world post-production work? github ↗

May 2026 Can AI agents do real-world post-production work? We gave the 7 best frontier models 100 expert-authored tasks across the four stages of post-production. The best agent barely crosses 30%. Human experts scored 89%. Read the paperLeaderboardCode & dataTasksDiscord 100 Tasks

20 Industry experts

Frontier models

Task families

Why this benchmark exists Verification is not here for free. RLVR works in math and code because centuries of humanistic work built the verifiers, the bill was paid before we got there. Creative work hasn't paid that bill. AgenticVBench is what paying it looks like in film. Read the full essay →

Leaderboard preview Top 5 model × harness combinations.

View full leaderboard → RankAgentAvgRepurposeSeqRepairAssembly·Human expertsreference88.5%95%90%88%81%1GPT-5.5· Codex31.0%± 4.030%26%30%38%2GPT-5.5· OpenCode27.4%± 3.527%20%27%37%4Claude Opus 4.7· Claude Code22.1%± 3.530%20%17%22%5GPT-5.5· OpenClaw21.9%± 2.920%29%21%18%6Claude Opus 4.7· OpenClaw21.1%± 3.418%19%25%22%

What the bench tests Four task families spanning the real-world post-production workflow. Authored by 20 industry experts averaging 6 years of post-production experience. Tasks span 30 minutes to one week of human work. Assembly 18 tasks

43 pp gap

Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot. Best agent 38%Human 81%

Repair 18 tasks

59 pp gap

Given a video with defects (frozen scene, scene swap, color drift, or audio noise), localize them and produce a fixed cut. Best agent 30%Human 88%

Sequencing 28 tasks

61 pp gap

Given a brief story overview and a shuffled set of clips, recover the correct narrative order. Best agent 29%Human 90%

Repurpose 36 tasks

65 pp gap

Given 4-150 minutes of source video and a creative brief, repurpose it into a short deliverable that follows the brief and preserves the story. Best agent 30%Human 95%

The harness finding The harness matters as much as the model. Holding the model fixed and varying the harness shifts GPT-5.5's Assembly score by 20 percentage points, comparable to the gap between adjacent models on the leaderboard. Most benchmarks today are still model-based. The data here says that's wrong. Agent performance is determined by both the model and the scaffolding around it. Reporting only the model misses the larger story. Agent = model × harness.

GPT-5.5 on Assembly · score by harness Codex

38% OpenCode

37% OpenClaw

18%

Same model. 20-point swing.

The first benchmark to test AI agent's video editing capability

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play