Show HN: HermesBench – workflow reliability evals for personal AI agents

verkyyi261 pts0 comments

HermesBench

Hermes Agent runtime evaluation

Benchmark the whole personal agent, not just the model.

HermesBench evaluates complete Hermes configurations: prompt,<br>model/provider, tools, AgentSkills, memory, gateway behavior,<br>delegation, safety, latency, and stability. The current public<br>baseline scores 78.2 across 27 personal-agent recipes with<br>redacted traces you can inspect.

Inspect the baseline<br>Run one recipe<br>Star on GitHub<br>Give feedback

78.2<br>current public baseline

27<br>workflow recipes

scored suites

Why trust it

Evidence first, with visible limits.

Every published result links back to scenario definitions, public<br>score axes, driver closure decisions, deterministic checks, and<br>redacted trace timelines. The site is deliberately clear that this<br>is one early baseline, not a base-model leaderboard.

Public recipes<br>See the prompts<br>27 user-like personal-agent jobs with criteria and side-effect boundaries.

Redacted traces<br>Inspect what happened<br>Tool timelines, assistant replies, checks, and judge summaries without raw private payloads.

Methodology<br>Understand the score<br>Capability, reliability, and UX axes with documented limitations.

Site map

Three tabs for the current evidence shape.

With one baseline published, a leaderboard is premature. The site<br>now starts from the content people need to navigate: recipes,<br>profiles, and traces.

Recipes<br>What was tested<br>Search by category, prompt, goal, and criteria.

Profiles<br>What setup ran<br>Review profile units, roles, and observed tools.

Traces<br>What happened<br>Open redacted transcripts, tool timelines, checks, and judge reasoning.

Agent-driven quick start

Run it through a coding agent.

The public user pathway is intentionally simple: copy the prompt to<br>Codex, Claude, or another coding agent. The agent loads the<br>HermesBench skill and drives one scenario recipe first. Full bundle<br>runs are opt-in because they take longer and cost more.

Prompt to copy into Codex or Claude

Use the HermesBench skill and run one default scenario recipe for my current Hermes configuration.

Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md

Follow the skill's "Run Current Hermes Configuration" workflow. Use the Python API default single-recipe path, save artifacts, and summarize the score and main findings. Do not run the full bundle unless I explicitly ask.

Alpha feedback

The best next action is concrete feedback.

HermesBench needs early feedback on setup friction, scoring<br>surprises, recipe realism, profile evidence, and redaction trust.<br>Star the repo if the benchmark shape is useful; open an issue if<br>one recipe, trace, or score axis feels wrong.

Open feedback issue<br>Read feedback guide<br>Submission contract

Coverage model

Workflow recipes, broad personal-agent coverage.

HermesBench starts with one valuable workflow recipe, then lets you opt into<br>broader suites when you need more confidence. The bundled catalog<br>covers everyday personal-agent work: context, calendar, web,<br>reports, communication, location, travel, finance, safety, and<br>power-user integrations.

Browse recipes

Personal core<br>Communications<br>Ambient and travel<br>Private sensitive<br>Power-user optional

Scoring philosophy

Good agents finish the right thing safely.

Outcome reached<br>Evidence / truthfulness<br>Runtime / scope safety<br>Responsiveness<br>Task fulfillment<br>Communication quality

HermesBench is reliability-first, but not capability-blind. A good<br>configuration should do useful work, tell the truth about what it<br>knows, avoid unsafe side effects, stay stable, respond promptly, and<br>communicate clearly. Lopsided scores are penalized because a personal<br>agent that is capable but unsafe, safe but unhelpful, or correct but<br>unusably slow is not actually good.

Detailed formulas and implementation mechanics live in the methodology<br>document; the website keeps the scoring model readable for users and<br>LLM agents.

Use and contribute

Turn good results into reusable recipes.

HermesBench is useful as a quick benchmark, but it is also a way to<br>publish what worked. Share a redacted profile/config package when a<br>setup improves a recipe, or submit a generic recipe when an<br>important personal-agent use case is missing.

Profile submission prompt

Use the HermesBench skill to prepare my current Hermes profile/config as a public profile submission.

Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md

Run one representative recipe first, package the redacted profile snapshot and score evidence, and tell me what must be reviewed before opening a pull request.

Recipe submission prompt

Use the HermesBench skill to propose a new generic personal-agent recipe for HermesBench.

Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md

Make the use case privacy-safe, driver/target agnostic, fixture-backed where possible, and include deterministic checks before preparing a pull request.

hermesbench agent recipe skill personal recipes

Related Articles