Confidence estimation is a better metric than agreement for LLM judges

rapiddev2 pts0 comments

[2604.20972] Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

-->

Computer Science > Artificial Intelligence

arXiv:2604.20972 (cs)

[Submitted on 22 Apr 2026]

Title:Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

Authors:Michael O'Herlihy, Rosa Català<br>View a PDF of the paper titled Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI, by Michael O'Herlihy and 1 other authors

View PDF<br>HTML (experimental)

Abstract:Content moderation systems are typically evaluated by measuring agreement with human labels. In rule-governed environments this assumption fails: multiple decisions may be logically consistent with the governing policy, and agreement metrics penalize valid decisions while mischaracterizing ambiguity as error -- a failure mode we term the Agreement Trap. We formalize evaluation as policy-grounded correctness and introduce the Defensibility Index (DI) and Ambiguity Index (AI). To estimate reasoning stability without additional audit passes, we introduce the Probabilistic Defensibility Signal (PDS), derived from audit-model token logprobs. We harness LLM reasoning traces as a governance signal rather than a classification output by deploying the audit model not to decide whether content violates policy, but to verify whether a proposed decision is logically derivable from the governing rule hierarchy. We validate the framework on 193,000+ Reddit moderation decisions across multiple communities and evaluation cohorts, finding a 33-46.6 percentage-point gap between agreement-based and policy-grounded metrics, with 79.8-80.6% of the model's false negatives corresponding to policy-grounded decisions rather than true errors. We further show that measured ambiguity is driven by rule specificity: auditing 37,286 identical decisions under three tiers of the same community rules reduces AI by 10.8 pp while DI remains stable. Repeated-sampling analysis attributes PDS variance primarily to governance ambiguity rather than decoding noise. A Governance Gate built on these signals achieves 78.6% automation coverage with 64.9% risk reduction. Together, these results show that evaluation in rule-governed environments should shift from agreement with historical labels to reasoning-grounded validity under explicit rules.

Comments:<br>22 pages, 10 figures, preprint. Research on Defensibility Index (DI), Ambiguity Index (AI), and Probabilistic Defensibility Signal (PDS) for policy-grounded evaluation of rule-governed AI in content moderation (Reddit production data)

Subjects:

Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Cite as:<br>arXiv:2604.20972 [cs.AI]

(or<br>arXiv:2604.20972v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.20972

Focus to learn more

arXiv-issued DOI via DataCite

Submission history<br>From: Michael O'Herlihy [view email]<br>[v1]<br>Wed, 22 Apr 2026 18:05:29 UTC (2,100 KB)

Full-text links:<br>Access Paper:

View a PDF of the paper titled Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI, by Michael O'Herlihy and 1 other authors<br>View PDF<br>HTML (experimental)<br>TeX Source

view license

Current browse context:

cs.AI

next >

new<br>recent<br>| 2026-04

Change to browse by:

cs<br>cs.CY

References & Citations

NASA ADS<br>Google Scholar

Semantic Scholar

export BibTeX citation<br>Loading...

BibTeX formatted citation

&times;

loading...

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author

Venue

Institution

Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere...

toggle agreement rule defensibility governed arxiv

Related Articles