Confidence estimation is a better metric than agreement for LLM judges

[2604.20972] Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

-->

Computer Science > Artificial Intelligence

arXiv:2604.20972 (cs)

[Submitted on 22 Apr 2026]

Title:Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

Authors:Michael O'Herlihy, Rosa Català View a PDF of the paper titled Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI, by Michael O'Herlihy and 1 other authors

View PDF HTML (experimental)

Abstract:Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize valid decisions while mischaracterizing ambiguity as error -- a failure mode we term the Agreement Trap. We formalize evaluation as policy-grounded correctness and introduce the Defensibility Index (DI) and Ambiguity Index (AI). To estimate reasoning stability without additional audit passes, we introduce the Probabilistic Defensibility Signal (PDS), derived from audit-model token logprobs. We harness LLM reasoning traces as a governance signal rather than a classification output by deploying the audit model not to decide whether content violates policy, but to verify whether a proposed decision is logically derivable from the governing rule hierarchy. We validate the framework on 193,000+ Reddit moderation decisions across multiple communities and evaluation cohorts, finding a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with 79.8-80.6% of the model's false negatives corresponding to policy-grounded decisions rather than true errors. We further show that measured ambiguity is driven by rule specificity: auditing 37,286 identical decisions under three tiers of the same community rules reduces AI by 10.8 pp while DI remains stable. Repeated-sampling analysis attributes PDS variance primarily to governance ambiguity rather than decoding noise. A Governance Gate built on these signals achieves 78.6% automation coverage with 64.9% risk reduction. Together, these results show that evaluation in rule-governed environments should shift from agreement with historical labels to reasoning-grounded validity under explicit rules.

Comments: 22 pages, 10 figures, preprint. Research on Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) for policy-grounded evaluation of rule-governed AI in content moderation (Reddit production data)

Subjects:

Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Cite as: arXiv:2604.20972 [cs.AI]

(or arXiv:2604.20972v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.20972

Focus to learn more

arXiv-issued DOI via DataCite

Submission history From: Michael O'Herlihy [view email] [v1] Wed, 22 Apr 2026 18:05:29 UTC (2,100 KB)

Full-text links: Access Paper:

View a PDF of the paper titled Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI, by Michael O'Herlihy and 1 other authors View PDF HTML (experimental) TeX Source

view license

Current browse context:

cs.AI

next >

new recent | 2026-04

Change to browse by:

cs cs.CY

References & Citations

NASA ADS Google Scholar

Semantic Scholar

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Confidence estimation is a better metric than agreement for LLM judges

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI