CEO-Bench
CEO-Bench
Can agents play the long game?
Authors
Haozhe Chen<br>Karthik Narasimhan<br>Zhuang Liu
Affiliation
Princeton University<br>Princeton University<br>Princeton University
Published
June 2026
Links
Code<br>Paper<br>Trajectories
Display model names
We measure cash balance as the performance metric. This plot shows cash balance over time for the best run of each model and the rule-based baseline. *For Claude Fable 5, one run stopped due to refusal. For the other two runs, requests sometimes fallback to Opus 4.8 due to refusal.
TL;DR
Today, agents execute individual tasks. Tomorrow, agents steer organizations toward long-term goals.
We introduce CEO-Bench to measure this steering intelligence . In CEO-Bench, agents operate a simulated AI startup for 500 days.
Most models struggle to finish above the $1M starting balance; Claude Fable 5, Claude Opus 4.8, and GPT-5.5 are the only evaluated models that finish above the initial balance on their best run, and only Claude Fable 5 finishes above the initial balance for more than one run.
Introduction: From Task Intelligence to Steering Intelligence
A Story
Cupertino, 1997.
Apple was ninety days from bankruptcy. Inside a conference room at headquarters, the company's leaders faced the possibility that Apple might not survive.
Steve Jobs walked to the whiteboard and drew a simple grid : Consumer and Professional, Desktop and Portable. Four boxes to hold the whole company. He made the decision: Apple would build only for those four boxes.
It was a painful cut. Products disappeared. Teams were broken apart. But the decision gave Apple something it had lost: focus. The iMac came next. Then the iPod. Then the iPhone. A company near collapse became one of the most valuable companies in the world.
The Next Frontier: Steering Intelligence
Steve Jobs showed a kind of strategic intelligence that has appeared throughout history, driving some of humanity's most monumental achievements.
This kind of intelligence is fundamentally different from intelligence in AI agents today. Today, we build agents that get rapidly better at performing individual tasks like coding and writing. To contribute more value, agents tomorrow need to steer organizations toward long-term goals.
We build CEO-Bench as a first step of measuring Steering Intelligence.
Today, we measure AI agent's intelligence to perform isolated tasks. The next frontier is measuring intelligence to steer systems across long horizon towards distant goals.
Introducing CEO-Bench
In CEO-Bench, we aim to measure the combination of four core skills to steer systems through real-world challenges:
Navigating long horizons amid uncertainty
Acquiring information in noisy environments
Adapting to a changing world
Orchestrating multiple moving parts toward a coherent goal
We evaluate on a canonical real-world task: operating a simulated startup for 500 days .
We give agents $1M starting cash and measure cash balance at the end of simulation as performance metric. The agent operates through a programmable interface with access to business databases, company management tools, and social media. Outcomes are driven by a partially observable, noisy, and evolving market with delayed and coupled consequences.
How CEO-Bench Works
Running a startup requires coordinating many moving parts, making it a fitting choice as a canonical task evaluating agent's skills to steer complex decisions across long-horizon.
What an agent can do. Agents act weekly through 34 tools covering pricing, growth, product, operations, information acquisition, public communication, and enterprise sales.<br>Read moreRead less
For each simulated week, the agent can take actions for unlimited turns across 34 tools in the categories displayed in the table below. These categories cover pricing and plan design, growth and market expansion, product quality and research, reliability and support, information acquisition, public communication, and enterprise sales. Each tool accepts fine-grained structured arguments, so agents can compose a large space of possible policies.
Category<br>Actions<br>Example tools
Database query<br>Query 19 business SQL databases and conduct data analytics<br>query
Pricing and monetization<br>Set prices, usage quotas, discounts, and in-product ads<br>pricing.set_prices, pricing.set_usage_quotas
Growth and market expansion<br>Allocate targeted advertising spend and promotion across channels and customer groups<br>marketing.set_targeted_ad_spend, marketing.set_lead_promotion
Product quality and R&D<br>Choose model tiers, fund day-to-day development, and launch research projects<br>pricing.set_model_tiers, research.start_research_project
Operations and reliability<br>Buy infrastructure capacity and fund customer support<br>infrastructure.set_capacity_tier, analytics.set_targeted_ops_spend
Enterprise sales<br>Conduct multi-turn negotiations over price and plan with enterprise prospects and renewals<br>enterprise.send_enterprise_deal,...