Kicking the Tyres on Harbor for Agent Evals

09 Apr 2026 by · AI, Claude Code, Harbor at https://rmoff.net/2026/04/09/kicking-the-tyres-on-harbor-for-agent-evals/

Table of Contents

AIClaude CodeHarbor

After cobbling together my own eval for Claude, I was interested to discover harbor. It’s described as:

A framework for evaluating and optimizing agents and models in container environments.

Which sounds kinda cool, right?

It ships with a bunch of pre-created tests and benchmarks, such as the mandatory hello-world to more complex and multi-task examples such as terminal-bench.

Harbor’s unit of execution is a task, which is basically a prompt for a coding agent (such as Claude Code). Harbor works with multiple coding agents, and multiple models. Which is basically what it says on the tin above, right?

Here’s an example task:

Create a file called hello.txt with "Hello, world!" as the content.

Trying it out 🔗

Let’s try out hello-world:

harbor run --model anthropic/claude-sonnet-4-6 \ (1) --agent claude-code \ (2) --dataset hello-world \ (3)

Use Sonnet 4.6 model

Run the test using Claude Code

Run the pre-packaged "Hello, World" test

After a short time this completes:

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Metric ┃ Value ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ Agent │ claude-code (claude-sonnet-4-6) │ │ Dataset │ hello-world │ │ Trials │ 1 │ │ Errors │ 0 │ │ │ │ │ Mean │ 1.000 │ │ │ │ │ Reward Distribution │ │ │ reward = 1.0 │ 1 │ └─────────────────────┴─────────────────────────────────┘

Harbor ships (see what I did there? 😉) with a nice dashboard for exploring test runs. Spin it up by pointing it at the output folder (jobs, in this case):

harbor view jobs

Status and timing breakdown:

Under the covers, Harbor spins up a Docker container, within which Claude runs with --dangerously-skip-permissions so that it can go about its business without any of that pesky permission seeking. It takes the defined task or dataset prompt, and runs it, as we can see here:

Scoring 🔗

A task’s performance is scored using a verifier that’s part of the task definition. For the above "hello world", all we need to do is check if the agent (a) created the file with the correct name and (b) with the correct content. Which is exactly what this Python script does:

def test_hello_file_exists(): hello_path = Path("/app/hello.txt")

assert hello_path.exists(), f"File {hello_path} does not exist"

def test_hello_file_contents(): hello_path = Path("/app/hello.txt")

content = hello_path.read_text().strip() expected_content = "Hello, world!"

assert content == expected_content, ( f"File content is '{content}', expected '{expected_content}'"

Its wrapper script gives it a pass/fail score—either it worked, or it didn’t:

if [ $? -eq 0 ]; then echo 1 > /logs/verifier/reward.txt else echo 0 > /logs/verifier/reward.txt fi

If you run the test multiple times, you’ll get scores (rewards); unsurprisingly "hello-world" doesn’t pose any challenges or show variability:

So that’s Hello World…what about the Real World?

Using it with dbt 🔗

The driver to looking at Harbor was my curiosity as to whether I could have used Harbor in place of my hacky homebrew bespoke and artisanal scripts, and if so what it would look like.

There is an imbalance here in that I now know more about evals, deterministic testing and LLM-as-judge than I did before creating my custom harness. If I were to write it again, it’d be a lot cleaner I’m sure. So almost by definition, something like Harbor is probably going to be better.

As you’d expect by now, I didn’t write the dbt task myself; I told Claude about my previous work, and had it build a Harbor-compliant task. Its key components are this:

Component Description

Dockerfile

Installs dbt-duckdb and the dbt-agent-skills in the same container that Claude Code will run.

instruction.md

The same prompt as before

[…] Build a dbt project using DuckDB for this data using idiomatic patterns and good practices. […] Run dbt build to verify your work. If it fails, fix the errors and re-run until it passes.

tests/test.sh

Verifier script, which does several things:

Deterministic checks (similar to the ones I used before - is the dbt project there, does it have models defined, etc)

Non-deterministic checks, via LLM-as-judge

If the LLM-as-judge succeeds, uses its score for the Harbor reward; if it fails, fall back on the deterministic checks score

tests/llm_judge.py, tests/rubric.md

Script to call out to LLM to judge the work, using the rubric provided (similar to the one used before)

There are plenty of gaps in this, such as only using the non-deterministic score. The final Harbor reward value should probably be a weighted version of the deterministic and non-deterministic verification. However, this was more about understanding the scope of Harbor than building the perfect test.

With this in place I could...

Kicking the Tyres on Harbor for Agent Evals

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast