CEO-Bench: Can AI run a simulated startup for 500 days?

tonychenxyz1 pts1 comments

CEO-Bench

CEO-Bench

Can agents play the long game?

Authors

Haozhe Chen<br>Karthik Narasimhan<br>Zhuang Liu

Affiliation

Princeton University<br>Princeton University<br>Princeton University

Published

June 2026

Links

Code<br>Paper<br>Trajectories

Display model names

We measure cash balance as the performance metric. This plot shows cash balance over time for the best run of each model and the rule-based baseline. *For Claude Fable 5, one run stopped due to refusal. For the other two runs, requests sometimes fallback to Opus 4.8 due to refusal.

TL;DR

Today, agents execute individual tasks. Tomorrow, agents steer organizations toward long-term goals.

We introduce CEO-Bench to measure this steering intelligence . In CEO-Bench, agents operate a simulated AI startup for 500 days.

Most models struggle to finish above the $1M starting balance; Claude Fable 5, Claude Opus 4.8, and GPT-5.5 are the only evaluated models that finish above the initial balance on their best run, and only Claude Fable 5 finishes above the initial balance for more than one run.

Introduction: From Task Intelligence to Steering Intelligence

A Story

Cupertino, 1997.

Apple was ninety days from bankruptcy. Inside a conference room at headquarters, the company's leaders faced the possibility that Apple might not survive.

Steve Jobs walked to the whiteboard and drew a simple grid : Consumer and Professional, Desktop and Portable. Four boxes to hold the whole company. He made the decision: Apple would build only for those four boxes.

It was a painful cut. Products disappeared. Teams were broken apart. But the decision gave Apple something it had lost: focus. The iMac came next. Then the iPod. Then the iPhone. A company near collapse became one of the most valuable companies in the world.

The Next Frontier: Steering Intelligence

Steve Jobs showed a kind of strategic intelligence that has appeared throughout history, driving some of humanity's most monumental achievements.

This kind of intelligence is fundamentally different from intelligence in AI agents today. Today, we build agents that get rapidly better at performing individual tasks like coding and writing. To contribute more value, agents tomorrow need to steer organizations toward long-term goals.

We build CEO-Bench as a first step of measuring Steering Intelligence.

Today, we measure AI agent's intelligence to perform isolated tasks. The next frontier is measuring intelligence to steer systems across long horizon towards distant goals.

Introducing CEO-Bench

In CEO-Bench, we aim to measure the combination of four core skills to steer systems through real-world challenges:

Navigating long horizons amid uncertainty

Acquiring information in noisy environments

Adapting to a changing world

Orchestrating multiple moving parts toward a coherent goal

We evaluate on a canonical real-world task: operating a simulated startup for 500 days .

We give agents $1M starting cash and measure cash balance at the end of simulation as performance metric. The agent operates through a programmable interface with access to business databases, company management tools, and social media. Outcomes are driven by a partially observable, noisy, and evolving market with delayed and coupled consequences.

How CEO-Bench Works

Running a startup requires coordinating many moving parts, making it a fitting choice as a canonical task evaluating agent's skills to steer complex decisions across long-horizon.

What an agent can do. Agents act weekly through 34 tools covering pricing, growth, product, operations, information acquisition, public communication, and enterprise sales.<br>Read moreRead less

For each simulated week, the agent can take actions for unlimited turns across 34 tools in the categories displayed in the table below. These categories cover pricing and plan design, growth and market expansion, product quality and research, reliability and support, information acquisition, public communication, and enterprise sales. Each tool accepts fine-grained structured arguments, so agents can compose a large space of possible policies.

Category<br>Actions<br>Example tools

Database query<br>Query 19 business SQL databases and conduct data analytics<br>query

Pricing and monetization<br>Set prices, usage quotas, discounts, and in-product ads<br>pricing.set_prices, pricing.set_usage_quotas

Growth and market expansion<br>Allocate targeted advertising spend and promotion across channels and customer groups<br>marketing.set_targeted_ad_spend, marketing.set_lead_promotion

Product quality and R&D<br>Choose model tiers, fund day-to-day development, and launch research projects<br>pricing.set_model_tiers, research.start_research_project

Operations and reliability<br>Buy infrastructure capacity and fund customer support<br>infrastructure.set_capacity_tier, analytics.set_targeted_ops_spend

Enterprise sales<br>Conduct multi-turn negotiations over price and plan with enterprise prospects and renewals<br>enterprise.send_enterprise_deal,...

agents intelligence bench long balance pricing

Related Articles