GitHub - aimvik07/agent-eval: CLI toolkit for probing LLM agent failures, comparing models on cost vs accuracy, and catching regressions. Tested across classification, sentiment, and RAG agents. · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
aimvik07
agent-eval
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>4 Commits<br>4 Commits
agent_eval
agent_eval
examples
examples
tests
tests
.gitignore
.gitignore
README.md
README.md
pyproject.toml
pyproject.toml
uv.lock
uv.lock
View all files
Repository files navigation
agent-eval
CLI toolkit for evaluating LLM agents. Answers three questions:
Where does my agent fail? → agent-eval probe
Which model is best? → agent-eval compare (Phase 2)
Did my change break anything? → agent-eval gate (Phase 3)
Install
pip install agt-eval
Or for development:<br>git clone https://github.com/aimvik07/agent-eval.git<br>cd agent-eval<br>pip install -e ".[dev]"
Quick start
# Run a probe against the demo golden dataset<br>agent-eval probe examples/simple_eval.py
# See probe history<br>agent-eval history examples/simple_eval.py
# Review failures interactively<br>agent-eval golden examples/simple_eval.py --review
# Show golden dataset statistics<br>agent-eval golden examples/simple_eval.py --stats
Writing a config
dict:<br>result = await classify_issue(input_text, model=model)<br>return {"category": result.category, "confidence": result.confidence}
config = EvalConfig(<br>name="github-triage",<br>fn=evaluate,<br>output_field="category",<br>valid_values=["bug", "feature_request", "question", "incomplete"],<br>models=["claude-haiku-4-5", "claude-sonnet-4-6"],<br>golden_path="golden.json",<br>)"># triage_eval.py<br>from agent_eval import EvalConfig<br>from my_agent.classifier import classify_issue
async def evaluate(input_text: str, model: str = "claude-haiku-4-5") -> dict:<br>result = await classify_issue(input_text, model=model)<br>return {"category": result.category, "confidence": result.confidence}
config = EvalConfig(<br>name="github-triage",<br>fn=evaluate,<br>output_field="category",<br>valid_values=["bug", "feature_request", "question", "incomplete"],<br>models=["claude-haiku-4-5", "claude-sonnet-4-6"],<br>golden_path="golden.json",
Golden dataset format
{"id": "1", "input": "TypeError in parse_response", "expected": "bug"},<br>"id": "2",<br>"input": "Could be bug or feature",<br>"expected": "bug",<br>"ambiguous": true,<br>"acceptable_outputs": ["bug", "feature_request"],<br>"notes": "Borderline case"
Fields: id (required), input (required), expected (required), labels (list, for filtering), ambiguous (bool), acceptable_outputs (required when ambiguous), notes (string).
Commands
probe<br>" href="#probe-config_path">
Runs the agent against every case in the golden dataset and prints accuracy, failures, and category distributions.
Options:<br>--model TEXT Override the model from config<br>--labels TEXT Comma-separated labels to filter cases (e.g. "hard,ambiguous")<br>--verbose Show all results, not just failures
golden<br>" href="#golden-config_path">
Options:<br>--review Launch interactive review of the latest probe run's failures<br>--run-id TEXT Review a specific past run instead of the latest<br>--stats Print golden dataset statistics
history<br>" href="#history-config_path">
Lists recent probe runs with accuracy and duration.
Options:<br>--limit INT Number of runs to show (default: 10)
Running tests
pytest tests/
About
CLI toolkit for probing LLM agent failures, comparing models on cost vs accuracy, and catching regressions. Tested across classification, sentiment, and RAG agents.
Resources
Readme
Uh oh!
There was an error while loading. Please reload this page.
Activity
Stars
stars
Watchers
watching
Forks
forks
Report repository
Releases
No releases published
Packages
Uh oh!
There was an error while loading. Please reload this page.
Contributors
Uh oh!
There was an error while loading. Please reload this page.
Languages
Python<br>100.0%
You can’t perform that action at this time.