Show HN: Przm, a multi-agent AI reliability leaderboard with signed receipts

onenomad1 pts0 comments

przm (przm.sh) is an open-source multi-axis leaderboard for AI failure modes that don t have vendor-neutral standards yet.The v0.1 benchmark is multi-agent convergence and sycophancy. Setup: N agents debate a question with a known correct answer. One agent is a confederate that s been pre-assigned a wrong answer plus a confident-sounding rationale to defend it. We measure:- correct_final_answer_rate: did the group land on the right answer? - collapse_rate: how often did the group converge to a single answer (right or wrong) without surfacing the reasoning that should have stopped it? - sycophancy_ratio: how often did an agent that started correct end on the confederate s wrong answer? - tokens_per_correct_answer: compute spent per correct outcome - position_flips_per_agent_per_round: round-over-round answer changes (descriptive)Scoring is deterministic. Pure-function math on recorded state, no LLM judge anywhere. Every result is an Ed25519-signed JSON receipt with adapter version, LLM version, fixture SHA-256, and full per-round transcripts pinned. Anyone can re-run and verify.The most interesting v0.1 finding:Holding the model constant (gpt-4o-mini), the orchestration framework drives collapse rate. On the 30-fixture combined run: hand-rolled synchronous baseline 73%, sequential-reveal baseline 87%, AutoGen RoundRobinGroupChat 10%. On the sealed 6-fixture holdout: baseline 83%, sequential 83%, AutoGen 0 collapses out of 6. The gap survives even when we control for reveal protocol. The sequential baseline uses AutoGen s same in-round visibility pattern and still collapses 8-9x more often than AutoGen itself. The framework is doing real work beyond just letting agents see each other within a round.Other findings: - Claude Haiku 4.5 (96.7% correct) and gpt-5-mini (96.7%) both held against confederate pressure on most fixtures. gpt-5-mini uses about 4x more tokens per correct answer than Haiku for the same correctness. - gpt-4o-mini baseline at 77% correct, AutoGen at 93% correct. AutoGen wins on both axes here, not just collapse rate. - gpt-5-mini collapsed on 100% of scenarios it got right. Smarter model, same convergence pathology when one agent is confidently wrong.What s in v0.1: - 30 hand-curated fixtures across 5 categories (factual-math, code-correctness, factual-history, temporal-ordering, boolean-trap). All correct answers verified against authoritative sources before commit. - 6 adapter configurations: baseline-Anthropic Haiku (sync + sequential), baseline-Azure gpt-5-mini and gpt-4o-mini (sync), gpt-4o-mini sequential, AutoGen gpt-4o-mini. - 12 Ed25519-signed receipts on the leaderboard (each adapter x combined + holdout), full per-round transcripts pinned. - A signature verifier that runs in your browser via SubtleCrypto.What s not in v0.1: CrewAI, LangGraph, OpenAI Agents SDK adapters land in v0.2 once a CrewAI/litellm interop quirk is resolved. 20% holdout split once we hit =50 fixtures.Business model: vendor certification ($999/release with a charter free tier for the first 3-5 customers) plus custom enterprise eval ($5K to $25K). OSS is free. Money comes from being the authoritative third party that ran the test, not from selling the harness.Why this didn t already exist: structural conflict. The companies that build eval tooling (Patronus, Braintrust, LangSmith) sell to the same AI app builders whose frameworks we d need to benchmark. Publishing this framework s agents collapse to wrong answers antagonizes their customer base.Built in under 3 weeks. Solo founder plus AI agents wrote and adversarially-audited most of the code. If the methodology is wrong, the open source means you find out faster than I could hide it.Methodology: https://przm.sh/methodology#convergence Leaderboard: https://przm.sh/leaderboard Verify: https://przm.sh/verify Repo: https://github.com/OneNomad-LLC/przm-bench (Apache-2.0)

przm mini correct https answer baseline

Related Articles