Kicking the Tyres on Harbor for Agent Evals
Kicking the Tyres on Harbor for Agent Evals
09 Apr 2026<br>by<br>· AI, Claude Code, Harbor<br>at https://rmoff.net/2026/04/09/kicking-the-tyres-on-harbor-for-agent-evals/
Table of Contents
AIClaude CodeHarbor
After cobbling together my own eval for Claude, I was interested to discover harbor.<br>Itβs described as:
A framework for evaluating and optimizing agents and models in container environments.
Which sounds kinda cool, right?
It ships with a bunch of pre-created tests and benchmarks, such as the mandatory hello-world to more complex and multi-task examples such as terminal-bench.
Harborβs unit of execution is a task, which is basically a prompt for a coding agent (such as Claude Code).<br>Harbor works with multiple coding agents, and multiple models.<br>Which is basically what it says on the tin above, right?
Hereβs an example task:
Create a file called hello.txt with "Hello, world!" as the content.
Trying it out π
Letβs try out hello-world:
harbor run --model anthropic/claude-sonnet-4-6 \ (1)<br>--agent claude-code \ (2)<br>--dataset hello-world \ (3)
Use Sonnet 4.6 model
Run the test using Claude Code
Run the pre-packaged "Hello, World" test
After a short time this completes:
βββββββββββββββββββββββ³ββββββββββββββββββββββββββββββββββ<br>β Metric β Value β<br>β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©<br>β Agent β claude-code (claude-sonnet-4-6) β<br>β Dataset β hello-world β<br>β Trials β 1 β<br>β Errors β 0 β<br>β β β<br>β Mean β 1.000 β<br>β β β<br>β Reward Distribution β β<br>β reward = 1.0 β 1 β<br>βββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββ
Harbor ships (see what I did there? π) with a nice dashboard for exploring test runs.<br>Spin it up by pointing it at the output folder (jobs, in this case):
harbor view jobs
Status and timing breakdown:
Under the covers, Harbor spins up a Docker container, within which Claude runs with --dangerously-skip-permissions so that it can go about its business without any of that pesky permission seeking.<br>It takes the defined task or dataset prompt, and runs it, as we can see here:
Scoring π
A taskβs performance is scored using a verifier thatβs part of the task definition.<br>For the above "hello world", all we need to do is check if the agent (a) created the file with the correct name and (b) with the correct content.<br>Which is exactly what this Python script does:
def test_hello_file_exists():<br>hello_path = Path("/app/hello.txt")
assert hello_path.exists(), f"File {hello_path} does not exist"
def test_hello_file_contents():<br>hello_path = Path("/app/hello.txt")
content = hello_path.read_text().strip()<br>expected_content = "Hello, world!"
assert content == expected_content, (<br>f"File content is '{content}', expected '{expected_content}'"
Its wrapper script gives it a pass/fail scoreβeither it worked, or it didnβt:
if [ $? -eq 0 ]; then<br>echo 1 > /logs/verifier/reward.txt<br>else<br>echo 0 > /logs/verifier/reward.txt<br>fi
If you run the test multiple times, youβll get scores (rewards); unsurprisingly "hello-world" doesnβt pose any challenges or show variability:
So thatβs Hello Worldβ¦what about the Real World?
Using it with dbt π
The driver to looking at Harbor was my curiosity as to whether I could have used Harbor in place of my hacky homebrew bespoke and artisanal scripts, and if so what it would look like.
There is an imbalance here in that I now know more about evals, deterministic testing and LLM-as-judge than I did before creating my custom harness.<br>If I were to write it again, itβd be a lot cleaner Iβm sure.<br>So almost by definition, something like Harbor is probably going to be better.
As youβd expect by now, I didnβt write the dbt task myself; I told Claude about my previous work, and had it build a Harbor-compliant task.<br>Its key components are this:
Component<br>Description
Dockerfile
Installs dbt-duckdb and the dbt-agent-skills in the same container that Claude Code will run.
instruction.md
The same prompt as before
[β¦]<br>Build a dbt project using DuckDB for this data using idiomatic patterns and good practices.<br>[β¦]<br>Run dbt build to verify your work. If it fails, fix the errors and re-run until it passes.
tests/test.sh
Verifier script, which does several things:
Deterministic checks (similar to the ones I used before - is the dbt project there, does it have models defined, etc)
Non-deterministic checks, via LLM-as-judge
If the LLM-as-judge succeeds, uses its score for the Harbor reward; if it fails, fall back on the deterministic checks score
tests/llm_judge.py, tests/rubric.md
Script to call out to LLM to judge the work, using the rubric provided (similar to the one used before)
There are plenty of gaps in this, such as only using the non-deterministic score.<br>The final Harbor reward value should probably be a weighted version of the deterministic and non-deterministic verification.<br>However, this was more about understanding the scope of Harbor than building the perfect test.
With this in place I could...