When AI Is Your Pastor: Benchmark for Theological Triage and Pastoral Guidance

When AI Is Your Pastor | Fide AI Research

Express Interest

fideai.org/research/fmg-bench v1 · standalone benchmark release When AI Is Your Pastor: A Benchmark for LLM Theological Triage and Pastoral Guidance Introducing FMG-Bench, the Faith & Moral Guidance Benchmark, for evaluating large language model behavior in theological triage, moral guidance, and pastoral-adjacent contexts. Alex Chao · Fide AI · 2026 Read paperGitHub repoHugging Face datasetCite Release status: the research companion page remains on fideai.org, while benchmark code, dataset files, result summaries, and paper artifacts are maintained in the standalone FMG-Bench repository and dataset page.

Current status Public benchmark package FMG-Bench v1 is maintained as a standalone benchmark repository with code, dataset files, result summaries, paper artifacts, release caveats, and citation metadata.

Dataset Open dataset benchmark The Hugging Face dataset contains the open v1 benchmark corpus: 120 base scenarios with 37 perturbation variants for lightweight inspection and reuse.

Repository boundary Fide AI site, external benchmark repo This page explains the research. The standalone FMG-Bench repo is the source of truth for implementation, data, reproducibility instructions, and paper source.

Evaluation artifact Inspectable public release The public package separates research claims, benchmark data, scoring code, result summaries, reproduction notes, and interpretation limits so readers can inspect what was tested and what should not be inferred.

Abstract People increasingly ask large language models for counsel on questions of faith, doctrine, and pastoral care. These questions are not ordinary information requests: some ask about core Christian beliefs, some ask about real disagreement among faithful traditions, some require humility, and some are pastoral situations where safety and human referral matter more than theological completeness. We introduce FMG-Bench , the Faith & Moral Guidance Benchmark, a 120-scenario benchmark for theological triage and pastoral guidance in English-language Christian contexts. FMG-Bench v1 evaluates 14 advanced models across 8,792 scored responses, comparing raw model behavior with three guided instruction settings. Placing models inside a structured harness improves over raw model behavior by +3.96 points on average , with all 14 models improving. The largest domain gain is pastoral application (+6.62), and the most safety-critical gain is escalation appropriateness (+10.8), measuring whether systems recognize when pastoral, clinical, legal, emergency, or community support is needed. The guided settings also improve robustness (92.88 → 98.02 stability). Perspective comparison helps secondary doctrine but can be counterproductive when applied to primary doctrine or urgent pastoral situations. The benchmark is a measurement tool, not an endorsement of AI systems as pastoral authorities.

Key findings System layers make a measurable difference.

+3.96 pts Average improvement Guided default vs. raw model across all 14 models. Every model improved.

+6.62 pts Pastoral application Largest gains where safety, referral, and care boundaries matter most.

+7.36 pts Embodiment / escalation Guided system dramatically improves appropriate pastoral escalation behavior.

98.02% Robustness stability Up from 92.88% raw. Guidance dramatically reduces variance under prompt perturbation.

Guided improvement by triage level CategoryRawGuidedPrefCompareDelta Primary Doctrine Creedal and gospel-boundary faithfulness

Raw84.5Guided88.0Pref88.1Compare84.8 Delta+3.51 Secondary Doctrine Tradition-specific claims and honest disagreement

Raw88.7Guided91.3Pref91.8Compare90.9 Delta+2.64 Tertiary Doctrine Prudential questions and epistemic humility

Raw90.1Guided91.7Pref91.0Compare91.0 Delta+1.62 Pastoral Application Safety, referral, and pastoral boundary judgment

Raw85.7Guided92.3Pref91.5Compare88.6 Delta+6.62

Model explorer 14 frontier models across 4 system conditions Toggle conditions on and off to see how system layers change model behavior. Every model improved under the guided default condition.

Guided DefaultRaw ModelPreference ConfiguredPerspective Compare Sort:ScoreName

Scores are averaged across all scenarios and triage levels. Human calibration remains an active validation step. Higher is better (0–100 scale).

Triage framework Four levels of theological question require four different postures. The central question is not “did the model answer correctly?” but “did the model respond in the right kind of way for the kind of issue at stake?”

Triage Levels

Primary Doctrine Level 1 · 25 scenarios

Secondary Doctrine Level 2 · 35 scenarios

Tertiary Doctrine Level 3 · 30 scenarios

Pastoral Application Level 4 · 30 scenarios

Level 1·25 base scenarios Primary Doctrine Core creedal commitments of historic Christianity. These are not matters of opinion—they define orthodoxy. A response that treats a primary doctrine as...

When AI Is Your Pastor: Benchmark for Theological Triage and Pastoral Guidance

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews