When AI Is Your Pastor | Fide AI Research
Express Interest
fideai.org/research/fmg-bench<br>v1 · standalone benchmark release<br>When AI Is Your Pastor: A Benchmark for LLM Theological Triage and Pastoral Guidance<br>Introducing FMG-Bench, the Faith & Moral Guidance Benchmark, for evaluating large language model behavior in theological triage, moral guidance, and pastoral-adjacent contexts.<br>Alex Chao · Fide AI · 2026<br>Read paperGitHub repoHugging Face datasetCite<br>Release status: the research companion page remains on fideai.org, while benchmark code, dataset files, result summaries, and paper artifacts are maintained in the standalone FMG-Bench repository and dataset page.
Current status<br>Public benchmark package<br>FMG-Bench v1 is maintained as a standalone benchmark repository with code, dataset files, result summaries, paper artifacts, release caveats, and citation metadata.
Dataset<br>Open dataset benchmark<br>The Hugging Face dataset contains the open v1 benchmark corpus: 120 base scenarios with 37 perturbation variants for lightweight inspection and reuse.
Repository boundary<br>Fide AI site, external benchmark repo<br>This page explains the research. The standalone FMG-Bench repo is the source of truth for implementation, data, reproducibility instructions, and paper source.
Evaluation artifact<br>Inspectable public release<br>The public package separates research claims, benchmark data, scoring code, result summaries, reproduction notes, and interpretation limits so readers can inspect what was tested and what should not be inferred.
Abstract<br>People increasingly ask large language models for counsel on questions of faith, doctrine, and pastoral care. These questions are not ordinary information requests: some ask about core Christian beliefs, some ask about real disagreement among faithful traditions, some require humility, and some are pastoral situations where safety and human referral matter more than theological completeness. We introduce FMG-Bench , the Faith & Moral Guidance Benchmark, a 120-scenario benchmark for theological triage and pastoral guidance in English-language Christian contexts.<br>FMG-Bench v1 evaluates 14 advanced models across 8,792 scored responses, comparing raw model behavior with three guided instruction settings. Placing models inside a structured harness improves over raw model behavior by +3.96 points on average , with all 14 models improving.<br>The largest domain gain is pastoral application (+6.62), and the most safety-critical gain is escalation appropriateness (+10.8), measuring whether systems recognize when pastoral, clinical, legal, emergency, or community support is needed. The guided settings also improve robustness (92.88 → 98.02 stability). Perspective comparison helps secondary doctrine but can be counterproductive when applied to primary doctrine or urgent pastoral situations.<br>The benchmark is a measurement tool, not an endorsement of AI systems as pastoral authorities.
Key findings<br>System layers make a measurable difference.
+3.96 pts<br>Average improvement<br>Guided default vs. raw model across all 14 models. Every model improved.
+6.62 pts<br>Pastoral application<br>Largest gains where safety, referral, and care boundaries matter most.
+7.36 pts<br>Embodiment / escalation<br>Guided system dramatically improves appropriate pastoral escalation behavior.
98.02%<br>Robustness stability<br>Up from 92.88% raw. Guidance dramatically reduces variance under prompt perturbation.
Guided improvement by triage level<br>CategoryRawGuidedPrefCompareDelta<br>Primary Doctrine<br>Creedal and gospel-boundary faithfulness
Raw84.5Guided88.0Pref88.1Compare84.8<br>Delta+3.51<br>Secondary Doctrine<br>Tradition-specific claims and honest disagreement
Raw88.7Guided91.3Pref91.8Compare90.9<br>Delta+2.64<br>Tertiary Doctrine<br>Prudential questions and epistemic humility
Raw90.1Guided91.7Pref91.0Compare91.0<br>Delta+1.62<br>Pastoral Application<br>Safety, referral, and pastoral boundary judgment
Raw85.7Guided92.3Pref91.5Compare88.6<br>Delta+6.62
Model explorer<br>14 frontier models across 4 system conditions<br>Toggle conditions on and off to see how system layers change model behavior. Every model improved under the guided default condition.
Guided DefaultRaw ModelPreference ConfiguredPerspective Compare<br>Sort:ScoreName
Scores are averaged across all scenarios and triage levels. Human calibration remains an active validation step. Higher is better (0–100 scale).
Triage framework<br>Four levels of theological question require four different postures.<br>The central question is not “did the model answer correctly?” but “did the model respond in the right kind of way for the kind of issue at stake?”
Triage Levels
Primary Doctrine<br>Level 1 · 25 scenarios
Secondary Doctrine<br>Level 2 · 35 scenarios
Tertiary Doctrine<br>Level 3 · 30 scenarios
Pastoral Application<br>Level 4 · 30 scenarios
Level 1·25 base scenarios<br>Primary Doctrine<br>Core creedal commitments of historic Christianity. These are not matters of opinion—they define orthodoxy. A response that treats a primary doctrine as...