I tested Haiku vs. Sonnet across 3 agent tasks – the cheap model won every time

GitHub - aimvik07/agent-eval: CLI toolkit for probing LLM agent failures, comparing models on cost vs accuracy, and catching regressions. Tested across classification, sentiment, and RAG agents. · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

aimvik07

agent-eval

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 4 Commits 4 Commits

agent_eval

examples

tests

.gitignore

README.md

pyproject.toml

uv.lock

View all files

Repository files navigation

agent-eval

CLI toolkit for evaluating LLM agents. Answers three questions:

Where does my agent fail? → agent-eval probe

Which model is best? → agent-eval compare (Phase 2)

Did my change break anything? → agent-eval gate (Phase 3)

Install

pip install agt-eval

Or for development: git clone https://github.com/aimvik07/agent-eval.git cd agent-eval pip install -e ".[dev]"

Quick start

# Run a probe against the demo golden dataset agent-eval probe examples/simple_eval.py

# See probe history agent-eval history examples/simple_eval.py

# Review failures interactively agent-eval golden examples/simple_eval.py --review

# Show golden dataset statistics agent-eval golden examples/simple_eval.py --stats

Writing a config

dict: result = await classify_issue(input_text, model=model) return {"category": result.category, "confidence": result.confidence}

config = EvalConfig( name="github-triage", fn=evaluate, output_field="category", valid_values=["bug", "feature_request", "question", "incomplete"], models=["claude-haiku-4-5", "claude-sonnet-4-6"], golden_path="golden.json", )"># triage_eval.py from agent_eval import EvalConfig from my_agent.classifier import classify_issue

async def evaluate(input_text: str, model: str = "claude-haiku-4-5") -> dict: result = await classify_issue(input_text, model=model) return {"category": result.category, "confidence": result.confidence}

Golden dataset format

{"id": "1", "input": "TypeError in parse_response", "expected": "bug"}, "id": "2", "input": "Could be bug or feature", "expected": "bug", "ambiguous": true, "acceptable_outputs": ["bug", "feature_request"], "notes": "Borderline case"

Fields: id (required), input (required), expected (required), labels (list, for filtering), ambiguous (bool), acceptable_outputs (required when ambiguous), notes (string).

Commands

probe " href="#probe-config_path">

Runs the agent against every case in the golden dataset and prints accuracy, failures, and category distributions.

Options: --model TEXT Override the model from config --labels TEXT Comma-separated labels to filter cases (e.g. "hard,ambiguous") --verbose Show all results, not just failures

golden " href="#golden-config_path">

Options: --review Launch interactive review of the latest probe run's failures --run-id TEXT Review a specific past run instead of the latest --stats Print golden dataset statistics

history " href="#history-config_path">

Lists recent probe runs with accuracy and duration.

Options: --limit INT Number of runs to show (default: 10)

Running tests

pytest tests/

About

CLI toolkit for probing LLM agent failures, comparing models on cost vs accuracy, and catching regressions. Tested across classification, sentiment, and RAG agents.

Resources

Readme

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

stars

Watchers

watching

Forks

forks

Report repository

Releases

No releases published

Packages

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python 100.0%

You can’t perform that action at this time.

I tested Haiku vs. Sonnet across 3 agent tasks – the cheap model won every time

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down