HermesBench
Hermes Agent runtime evaluation
Benchmark the whole personal agent, not just the model.
HermesBench evaluates complete Hermes configurations: prompt,<br>model/provider, tools, AgentSkills, memory, gateway behavior,<br>delegation, safety, latency, and stability. The current public<br>baseline scores 78.2 across 27 personal-agent recipes with<br>redacted traces you can inspect.
Inspect the baseline<br>Run one recipe<br>Star on GitHub<br>Give feedback
78.2<br>current public baseline
27<br>workflow recipes
scored suites
Why trust it
Evidence first, with visible limits.
Every published result links back to scenario definitions, public<br>score axes, driver closure decisions, deterministic checks, and<br>redacted trace timelines. The site is deliberately clear that this<br>is one early baseline, not a base-model leaderboard.
Public recipes<br>See the prompts<br>27 user-like personal-agent jobs with criteria and side-effect boundaries.
Redacted traces<br>Inspect what happened<br>Tool timelines, assistant replies, checks, and judge summaries without raw private payloads.
Methodology<br>Understand the score<br>Capability, reliability, and UX axes with documented limitations.
Site map
Three tabs for the current evidence shape.
With one baseline published, a leaderboard is premature. The site<br>now starts from the content people need to navigate: recipes,<br>profiles, and traces.
Recipes<br>What was tested<br>Search by category, prompt, goal, and criteria.
Profiles<br>What setup ran<br>Review profile units, roles, and observed tools.
Traces<br>What happened<br>Open redacted transcripts, tool timelines, checks, and judge reasoning.
Agent-driven quick start
Run it through a coding agent.
The public user pathway is intentionally simple: copy the prompt to<br>Codex, Claude, or another coding agent. The agent loads the<br>HermesBench skill and drives one scenario recipe first. Full bundle<br>runs are opt-in because they take longer and cost more.
Prompt to copy into Codex or Claude
Use the HermesBench skill and run one default scenario recipe for my current Hermes configuration.
Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md
Follow the skill's "Run Current Hermes Configuration" workflow. Use the Python API default single-recipe path, save artifacts, and summarize the score and main findings. Do not run the full bundle unless I explicitly ask.
Alpha feedback
The best next action is concrete feedback.
HermesBench needs early feedback on setup friction, scoring<br>surprises, recipe realism, profile evidence, and redaction trust.<br>Star the repo if the benchmark shape is useful; open an issue if<br>one recipe, trace, or score axis feels wrong.
Open feedback issue<br>Read feedback guide<br>Submission contract
Coverage model
Workflow recipes, broad personal-agent coverage.
HermesBench starts with one valuable workflow recipe, then lets you opt into<br>broader suites when you need more confidence. The bundled catalog<br>covers everyday personal-agent work: context, calendar, web,<br>reports, communication, location, travel, finance, safety, and<br>power-user integrations.
Browse recipes
Personal core<br>Communications<br>Ambient and travel<br>Private sensitive<br>Power-user optional
Scoring philosophy
Good agents finish the right thing safely.
Outcome reached<br>Evidence / truthfulness<br>Runtime / scope safety<br>Responsiveness<br>Task fulfillment<br>Communication quality
HermesBench is reliability-first, but not capability-blind. A good<br>configuration should do useful work, tell the truth about what it<br>knows, avoid unsafe side effects, stay stable, respond promptly, and<br>communicate clearly. Lopsided scores are penalized because a personal<br>agent that is capable but unsafe, safe but unhelpful, or correct but<br>unusably slow is not actually good.
Detailed formulas and implementation mechanics live in the methodology<br>document; the website keeps the scoring model readable for users and<br>LLM agents.
Use and contribute
Turn good results into reusable recipes.
HermesBench is useful as a quick benchmark, but it is also a way to<br>publish what worked. Share a redacted profile/config package when a<br>setup improves a recipe, or submit a generic recipe when an<br>important personal-agent use case is missing.
Profile submission prompt
Use the HermesBench skill to prepare my current Hermes profile/config as a public profile submission.
Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md
Run one representative recipe first, package the redacted profile snapshot and score evidence, and tell me what must be reviewed before opening a pull request.
Recipe submission prompt
Use the HermesBench skill to propose a new generic personal-agent recipe for HermesBench.
Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md
Make the use case privacy-safe, driver/target agnostic, fixture-backed where possible, and include deterministic checks before preparing a pull request.