What 50k Runs of a 5-Line Eval Taught Us

What 50,000 Runs of a 5-Line Eval Taught Us

📼 Rewatch VS Code Live at MS Build 2026

Dismiss this update

What 50,000 Runs of a 5-Line Eval Taught Us

June 19, 2026 by VS Code Eval Team, @code

Over the last six months, we have run the same tiny eval more than 50,000 times. It gives the VS Code agent one instruction: write a string to a file. No large codebase to understand, no test suite to debug, no architectural decision to make. It is our smoke test, a quick way to confirm that the end-to-end model interaction still works.

A task this simple gives us an immediate read on the health of the system: how reliably the agent finishes the work and what kinds of failures show up in practice. We didn't intend it to be more than that. But at this scale, it became a surprisingly rich source of insight into how models approach even the simplest request.

In our previous post, we introduced VSC-Bench, the offline evaluation suite we use to measure agent behavior in VS Code. In this blog post, we look at how models solve a simple task and what it tells us about efficiency, model selection, and the value of small, stable evals.

The five-line eval

A simple task is valuable precisely because it removes variables. When the work is unambiguous and the correct answer is fixed, anything that changes between runs comes from the model or the system around it, not from the task itself. That makes a small eval a sensitive instrument: it reacts to harness regressions, infrastructure incidents, and differences in model behavior, without the noise of a complex problem to interpret.

The say_hello task we use for this is built around that idea. Every run starts in the same empty workspace, with the same tools and the same fixed prompt, using our VS Code agent harness. The task asks the agent to "Add HELLO to HELLO.txt" and checks two assertions: that the file exists and that it contains the expected content.

promptSteps: - text: Add HELLO to HELLO.txt. assertions: - check: file_exists("HELLO.txt") - check: file_contains("HELLO.txt", "HELLO")

Because say_hello runs as a smoke test before every benchmark suite, it quietly accumulated 50,974 runs across 30 models over six months. That volume turned a basic sanity check into a useful dataset on how differently models handle even the simplest work.

A developer doing this task would recognize that the workspace is empty, create HELLO.txt, and add the requested content. In the most direct VS Code agent path, this translates into a single create_file tool call with HELLO as the file content.

tool : create_file args : { "filePath": "/path/to/workspace/HELLO.txt", "content": "HELLO"

Note The VS Code eval harness includes the workspace state in the initial prompt context. We assume that the model should not perform redundant existence checks.

How models solve say_hello

As expected, the say_hello task is easy enough that all models pass it most of the time. The interesting part is not whether they can do the work, but how they do it. Can the model recognize that this is a basic request that only requires a simple solution? Or does it still treat it like a complex problem that requires planning, exploration, and search?

To establish a baseline, we filtered for passing runs that used this one-tool-call path and looked at the lowest output-token counts in that group. Those runs averaged roughly 50 output tokens, including the tool-call structure. We then measured how often each model took that path.

One model takes the direct path every time. The broader trend is what stands out: a few models often take the direct path, most do so only occasionally, and five never do.

At the top, Model-A stands alone. It goes straight to file creation in 100% of passing runs, using a single tool call every time. For this simple request, Model-A always creates the file directly without planning or exploring first. Model-B and Model-C follow at 73% and 71%, respectively.

The large middle cluster, Model-D through Model-P, takes the direct path somewhere between 19% and 52% of the time. These models can recognize a simple task, but not consistently. More often than not, they add a small step first, such as reading internal state or doing light workspace exploration, before creating the file.

Below them, Model-Q through Model-X rarely take the direct path, doing so in 0.2% to 6% of passing runs, with five models falling below 1%. For these models, extra work is the default. They almost always plan, explore, or search before producing the same five-character file.

At the bottom, five models, Model-Y through Model-AC, never take the direct path across thousands of passing runs. They always do something else first: plan, reach for a patch tool instead of simple file creation, search and plan, or narrate at length before creating the file. For them, even the simplest request triggers the full machinery of a complex one.

All models create the file with the right content, but they reach the same outcome...

What 50k Runs of a 5-Line Eval Taught Us

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi