Hallucination Detection Comparison

Hallucination Detection Comparison | Blue Guardrails<br>Book a Demo

Intro

Hallucinations are far from a solved issue. In our recent pharmaceutical RAG benchmark, we found that 24 to 65% of LLM responses contained hallucinations. Companies deploying AI applications need a way to systematically detect them in order to reduce the amount of ungrounded or false information produced by their systems. But which approach actually works?

We tested seven hallucination detection tools on PlaceboBench, including both open source frameworks (MiniCheck by Bespoke Labs, RAGAS) and proprietary cloud APIs (Azure Groundedness Detection, Google Cloud Check Grounding, AWS Bedrock Guardrails, Vectara HHEM), as well as our own, Blue Guardrails. Six tools achieved accuracies between 53.6% and 62.3% on message level, marginally better than guessing. Blue Guardrails reached 94.4%.

On the more granular claim level, the three existing tools (Azure Groundedness Detection, Google Cloud Check Grounding, MiniCheck) reached F1-scores of 22–24%. Blue Guardrails reached 92.3% F1. The remaining three tools (Vectara, RAGAS, AWS Bedrock Guardrails) operate only at message level and cannot be evaluated on claim-level hallucination detection.

What we tested

To test the performance of the different hallucination detection tools, we ran them on PlaceboBench, a pharmaceutical hallucination benchmark that uses 69 questions healthcare professionals submitted to drug information centers, paired with official regulatory documents from the European Medicines Agency. Then, seven state-of-the-art LLMs generated responses to the questions, resulting in a total of 483 data points. The hallucination annotation was done and reviewed by humans.

We measured two things: whether each tool correctly identified which responses contained hallucinations (accuracy at message level, all tools), and for claim-level tools additionally how precisely they located the hallucinated text within the response (F1 at claim level).

The tools tested were:

Claim-level

Message-level

RAGAS uses an LLM-as-judge (we chose GPT-5.2) to detect hallucinations. MiniCheck and Azure Groundedness Detection use a fine-tuned Transformer model. Blue Guardrails uses an LLM-based verification agent. The two remaining cloud APIs (Google Cloud Check Grounding, AWS Bedrock Guardrails) don't disclose their internals.

Azure Groundedness Detection, Google Cloud Check Grounding, MiniCheck, and Blue Guardrails provide claim-level hallucination detection, meaning they identify the exact text spans within a response that are hallucinated. AWS Bedrock Guardrails, Vectara, and RAGAS operate at the message level only, providing a binary yes/no verdict or score for whether the entire response contains hallucinations, without pinpointing where.

Vectara and RAGAS return a continuous score for "consistency" and "faithfulness" respectively; AWS Bedrock Guardrails returns both a score and a binary blocked/not-blocked verdict; Azure Groundedness Detection, Google Cloud Check Grounding, and Blue Guardrails return spans with character offsets; MiniCheck works sentence-by-sentence and also returns spans.

We ran each tool against the same dataset (PlaceboBench). Every data sample consists of

chunks of medical documents (the context)

a user query

an LLM-generated response

human-annotated hallucination spans marking exactly which parts of the response are not grounded in the source, and therefore considered hallucinated.

For tools that return spans directly (Azure Groundedness Detection, Google Cloud Check Grounding, MiniCheck, and Blue Guardrails), we compared predicted spans against the human annotations. For tools that return a continuous score (Vectara, RAGAS) or a binary verdict plus score (AWS Bedrock Guardrails), we performed a "threshold sweep" to find the threshold that maximizes their F1.

Azure Groundedness Detection comes with a practical constraint: responses with context longer than 55,000 characters had to be excluded because the API cannot handle contexts of this length. This led to a reduced test dataset of 294 samples (60.9% of the full dataset). The other tools ran on the full dataset.

Results

Message-level accuracy

Hallucination detection accuracy at message level across all seven tools.

Across the other six tools, accuracy at the message level (correctly identifying whether a response contains a hallucination) ranged from 53.6% to 62.3%. AWS Bedrock Guardrails performed best at 62.3%, with RAGAS close behind at 62.0%. Vectara was weakest at 53.6%. Blue Guardrails reached 94.4%. To put this in perspective: a classifier that flags every response as hallucinated would achieve roughly 45% accuracy on our dataset, so the margin above baseline is slim for most tools.

Claim-level F1

Claim-level hallucination detection F1 scores for claim-level tools.

For the tools that operate at claim level (Azure Groundedness...

Hallucination Detection Comparison

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy