I tested Haiku vs. Sonnet across 3 agent tasks – the cheap model won every time

aimvik071 pts0 comments

GitHub - aimvik07/agent-eval: CLI toolkit for probing LLM agent failures, comparing models on cost vs accuracy, and catching regressions. Tested across classification, sentiment, and RAG agents. · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

aimvik07

agent-eval

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>4 Commits<br>4 Commits

agent_eval

agent_eval

examples

examples

tests

tests

.gitignore

.gitignore

README.md

README.md

pyproject.toml

pyproject.toml

uv.lock

uv.lock

View all files

Repository files navigation

agent-eval

CLI toolkit for evaluating LLM agents. Answers three questions:

Where does my agent fail? → agent-eval probe

Which model is best? → agent-eval compare (Phase 2)

Did my change break anything? → agent-eval gate (Phase 3)

Install

pip install agt-eval

Or for development:<br>git clone https://github.com/aimvik07/agent-eval.git<br>cd agent-eval<br>pip install -e ".[dev]"

Quick start

# Run a probe against the demo golden dataset<br>agent-eval probe examples/simple_eval.py

# See probe history<br>agent-eval history examples/simple_eval.py

# Review failures interactively<br>agent-eval golden examples/simple_eval.py --review

# Show golden dataset statistics<br>agent-eval golden examples/simple_eval.py --stats

Writing a config

dict:<br>result = await classify_issue(input_text, model=model)<br>return {"category": result.category, "confidence": result.confidence}

config = EvalConfig(<br>name="github-triage",<br>fn=evaluate,<br>output_field="category",<br>valid_values=["bug", "feature_request", "question", "incomplete"],<br>models=["claude-haiku-4-5", "claude-sonnet-4-6"],<br>golden_path="golden.json",<br>)"># triage_eval.py<br>from agent_eval import EvalConfig<br>from my_agent.classifier import classify_issue

async def evaluate(input_text: str, model: str = "claude-haiku-4-5") -> dict:<br>result = await classify_issue(input_text, model=model)<br>return {"category": result.category, "confidence": result.confidence}

config = EvalConfig(<br>name="github-triage",<br>fn=evaluate,<br>output_field="category",<br>valid_values=["bug", "feature_request", "question", "incomplete"],<br>models=["claude-haiku-4-5", "claude-sonnet-4-6"],<br>golden_path="golden.json",

Golden dataset format

{"id": "1", "input": "TypeError in parse_response", "expected": "bug"},<br>"id": "2",<br>"input": "Could be bug or feature",<br>"expected": "bug",<br>"ambiguous": true,<br>"acceptable_outputs": ["bug", "feature_request"],<br>"notes": "Borderline case"

Fields: id (required), input (required), expected (required), labels (list, for filtering), ambiguous (bool), acceptable_outputs (required when ambiguous), notes (string).

Commands

probe<br>" href="#probe-config_path">

Runs the agent against every case in the golden dataset and prints accuracy, failures, and category distributions.

Options:<br>--model TEXT Override the model from config<br>--labels TEXT Comma-separated labels to filter cases (e.g. "hard,ambiguous")<br>--verbose Show all results, not just failures

golden<br>" href="#golden-config_path">

Options:<br>--review Launch interactive review of the latest probe run's failures<br>--run-id TEXT Review a specific past run instead of the latest<br>--stats Print golden dataset statistics

history<br>" href="#history-config_path">

Lists recent probe runs with accuracy and duration.

Options:<br>--limit INT Number of runs to show (default: 10)

Running tests

pytest tests/

About

CLI toolkit for probing LLM agent failures, comparing models on cost vs accuracy, and catching regressions. Tested across classification, sentiment, and RAG agents.

Resources

Readme

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

stars

Watchers

watching

Forks

forks

Report repository

Releases

No releases published

Packages

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python<br>100.0%

You can’t perform that action at this time.

agent eval golden model probe category

Related Articles