Pros and Cons of the Harbor Framework

What We Learned About The Harbor Framework From More than 5,700 Benchmark Runs.

neurometric’s Substack

SubscribeSign in

What We Learned About The Harbor Framework From More than 5,700 Benchmark Runs. The infrastructure you run on matters a lot.

Rob May Jun 10, 2026

We ran a large-scale benchmark study to evaluate AI agent performance across a broad model catalog. What we expected to produce was a leaderboard. What we got instead was a lesson in how evaluation infrastructure can quietly invalidate your results before you ever read them.

The Numbers

5,791 benchmark runs. 15,750 individual graded tasks. That’s a meaningful dataset. But the headline isn’t a model score — it’s that the framework couldn’t reliably finish. 53% of runs errored out. An errored run produces nothing: zero tasks completed, no usable result. More than half our compute returned empty-handed, not because the models failed, but because the harness crashed. This matters beyond the obvious waste. When your error rate crosses 50%, you’re no longer sampling performance — you’re sampling infrastructure luck.

The Grading Problem

The errors weren’t just crashes. They were silent distortions inside the results that did come back. In 765 cases , the agent produced the exactly correct answer — and Harbor logged it as an error anyway. That’s one in five of all correct results misclassified as failures. A grading system that can’t recognize its own correct answers isn’t grading. It’s noise with a spreadsheet attached. The implication: any model rankings produced under these conditions would systematically understate performance, with the degree of understatement varying arbitrarily across runs. You can’t normalize your way out of that.

Dataset Coverage

Harbor ships with 80 datasets. We got 42 of them to run at all — just over half. Of those 42, 17 never produced a single scored result . That’s 17 datasets that opened, ran, and returned nothing actionable. Effective coverage: roughly 31% of the catalog produced data you could learn from. You cannot benchmark across a catalog when two-thirds of it is structurally inoperable.

The Hello-World Test

The clearest evidence is the simplest task. Harbor’s hello-world benchmark asks the agent to create a file containing "Hello, world!" That’s it. No reasoning. No retrieval. No multi-step planning. Just: write a file. We ran it across 1,526 agent+model combinations . 645 of them — 42% — never completed it cleanly even once. In one case, GPT-5.4 wrote the file perfectly. The run still errored. The intelligence was never the bottleneck. The harness was.

What This Means for AI Benchmarking

Benchmark infrastructure is load-bearing. It doesn’t just measure performance — it defines what counts as performance. When the harness fails silently, misclassifies correct answers, and drops the majority of its own datasets, the output isn’t a measurement. It’s a corrupted signal that looks like data. The field has spent enormous energy debating which benchmarks best capture model capability. That’s the right debate to have — once the execution layer can be trusted. Our results suggest the execution layer deserves a lot more scrutiny than it typically gets. A model that writes the correct answer deserves to have that answer counted. That’s table stakes for any evaluation system. When it isn’t met, the benchmark isn’t measuring models. It’s measuring the framework.

Discussion about this post CommentsRestacks

TopLatestDiscussions

No posts

Ready for more?

This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts

Pros and Cons of the Harbor Framework

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs