GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

Bigger models are not the way --> --> Bigger models are not the way Jun 18, 2026

A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling. The limits of this paradigm were put on the world’s stage when Claude Fable 5 was restricted by the US government just three days after its release, marking the first US AI ban stemming from national security. One of the biggest models in the world was banned because a single jailbreak was too much of a risk.

Bigger is better

The above is true in almost all cases. The biggest models in the world clearly score the highest on the Artificial Analysis Intelligence Index. Yet, Z.ai’s newest, GLM-5.2 (753B parameters, roughly 40B active), comes within just 4 points of GPT-5.5 and 9 points of Fable 5. Opus 4.8 and GPT-5.5 are proprietary and estimated to be in the 1-2T parameter range conservatively. If an open weight (MIT licensed) LLM can come so close to a closed weight model estimated to be 1.5 to 2 times bigger, it is clear that actual intelligence has plateaued significantly.

Bigger is not better

It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.

That seems incredibly rough for such a huge, popular model. Let’s test it with a relatively complex Python question with a clear architectural flaw.1

DeepSeek V4 Pro - Hallucinated Design a custom asyncio event loop policy in Python that overrides get_child_watcher(). Expand full prompt Reasoned for 3m 52s (7.7k tokens) Python import os import fcntl import threading import struct import asyncio import time from asyncio import AbstractChildWatcher

class StateManager:

Expand full completion GLM-5.2 - Correct Design a custom asyncio event loop policy in Python that overrides get_child_watcher(). Expand full prompt Reasoned for 12s (799 tokens) Below is a complete, production-ready implementation. A few important engineering notes up front, because the constraints you describe are unusual and a literal interpretation would be unsound:"Atomic, non-yielding read loop ... without asyncio.create_task and without raw select/poll." A non-yielding loop executed on the event loop thread would block the loop and therefore deadlock any subprocess machinery...

Expand full completion

DeepSeek V4 Pro used almost 10 times the reasoning tokens yet produced a confidently incorrect response. On the other hand, it took GLM-5.2 just 10 seconds and about 700 reasoning tokens to recognize the technical impossibility of a single-threaded task executing multiplexed I/O without ever yielding or utilizing system polling. (For the non technical, this is like asking a delivery driver to drop off packages at 3 houses at the same time without ever stopping the truck.)

GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.

The trilemma of modern AI

We should be very cautious about blindly increasing reasoning budget, corpus size, or parameter count. DeepSeek V4 Pro spent 3 minutes and 26 seconds wasting compute in a reasoning loop (raw reasoning here) just to generate a beautifully structured, confidently incorrect solution. Yet, a model half its size identified the paradox almost instantaneously. Even in today’s era as we near AGI, many of the biggest models will actively convince you that a solution is correct and that the problem was solvable as stated.

Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse. This applies for the consumer too, since we cannot continue to select models based on size or theoretical performance alone. Training and selection of AI needs to be designed around the unsolved trilemma of modern LLMs: raw capability, uncertainty calibration/hallucination rate, and computational efficiency.

Footnotes

Both models were given “high” reasoning effort, temperature 1, tested on OpenRouter, with the following system prompt: “You respond professionally. You are a highly capable coding assistant well-versed in Python.” GLM-5.2...

GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews