What We Learned About The Harbor Framework From More than 5,700 Benchmark Runs.
neurometric’s Substack
SubscribeSign in
What We Learned About The Harbor Framework From More than 5,700 Benchmark Runs.<br>The infrastructure you run on matters a lot.
Rob May<br>Jun 10, 2026
Share
We ran a large-scale benchmark study to evaluate AI agent performance across a broad model catalog. What we expected to produce was a leaderboard. What we got instead was a lesson in how evaluation infrastructure can quietly invalidate your results before you ever read them.
The Numbers
5,791 benchmark runs. 15,750 individual graded tasks.<br>That’s a meaningful dataset. But the headline isn’t a model score — it’s that the framework couldn’t reliably finish.<br>53% of runs errored out. An errored run produces nothing: zero tasks completed, no usable result. More than half our compute returned empty-handed, not because the models failed, but because the harness crashed.<br>This matters beyond the obvious waste. When your error rate crosses 50%, you’re no longer sampling performance — you’re sampling infrastructure luck.
The Grading Problem
The errors weren’t just crashes. They were silent distortions inside the results that did come back.<br>In 765 cases , the agent produced the exactly correct answer — and Harbor logged it as an error anyway. That’s one in five of all correct results misclassified as failures. A grading system that can’t recognize its own correct answers isn’t grading. It’s noise with a spreadsheet attached.<br>The implication: any model rankings produced under these conditions would systematically understate performance, with the degree of understatement varying arbitrarily across runs. You can’t normalize your way out of that.
Dataset Coverage
Harbor ships with 80 datasets. We got 42 of them to run at all — just over half. Of those 42, 17 never produced a single scored result . That’s 17 datasets that opened, ran, and returned nothing actionable.<br>Effective coverage: roughly 31% of the catalog produced data you could learn from. You cannot benchmark across a catalog when two-thirds of it is structurally inoperable.
The Hello-World Test
The clearest evidence is the simplest task.<br>Harbor’s hello-world benchmark asks the agent to create a file containing "Hello, world!" That’s it. No reasoning. No retrieval. No multi-step planning. Just: write a file.<br>We ran it across 1,526 agent+model combinations . 645 of them — 42% — never completed it cleanly even once.<br>In one case, GPT-5.4 wrote the file perfectly. The run still errored.<br>The intelligence was never the bottleneck. The harness was.
What This Means for AI Benchmarking
Benchmark infrastructure is load-bearing. It doesn’t just measure performance — it defines what counts as performance. When the harness fails silently, misclassifies correct answers, and drops the majority of its own datasets, the output isn’t a measurement. It’s a corrupted signal that looks like data.<br>The field has spent enormous energy debating which benchmarks best capture model capability. That’s the right debate to have — once the execution layer can be trusted. Our results suggest the execution layer deserves a lot more scrutiny than it typically gets.<br>A model that writes the correct answer deserves to have that answer counted. That’s table stakes for any evaluation system. When it isn’t met, the benchmark isn’t measuring models. It’s measuring the framework.
Share
Discussion about this post<br>CommentsRestacks
TopLatestDiscussions
No posts
Ready for more?
Subscribe
© 2026 neurometric · Privacy ∙ Terms ∙ Collection notice<br>Start your SubstackGet the app<br>Substack is the home for great culture
This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts