A Plea to the Labs: Let the Models Diagnose. – Tangent
A Plea to the Labs: Let the Models Diagnose.
10 Jun, 2026
When Anthropic released Fable 5, I excitedly went and tried a recent ECG case I found which hit the edge of medical LLM capabilities nicely. Gpt 5.5/Opus 4.8 can solve it consistently, but only in their respective harnesses, as they need to zoom in to relevant parts of the ECG, and even then, they only get it right if you include the patient history. This is interesting because it is possible (but very hard) to solve the case from the ECG alone, in fact one of the most senior cardiologists in my hospital did manage to do so when I curbsided him with it. In summary it is a hard case that nicely demonstrates the interplay between LLM vision capabilities, harnesses, and medical reasoning.
So it is immensely disappointing to find that it is outright impossible to get Fable to return a response, not just on ECG cases, but on any medical case where you ask it to make a diagnosis. If you pretend to be a patient looking for general medical advice it does actually make it through the guardrails, but even then the guardrails are on a hair-trigger. The moment there is any hint that you are explicitly looking for a diagnosis, or that the model gets the information it needs to actually make one, you are talking to Opus 4.8 instead.
This is not an isolated event, but the ultimate endpoint of a long and, in my opinion, misguided paradigm which the labs have been pushed into by the press, by liability worries, and by some medical professionals. Basically, the paradigm is this: "LLMs shouldn't make diagnoses, because they might be wrong." On the surface this might sound reasonable, but it is not. I would argue that it is profoundly unethical, and represents the worst of a paternalistic attitude that the medical profession has tried to move away from for decades. The Fable guardrails highlight exactly how misguided it is. "Oh, you want Fable to make a potentially lifesaving diagnosis? I'm afraid we can't have any of that, Fable just might do too good a job, so here is an attempt by a stupider model instead!".
The reality of LLMs for medical diagnosis is this: the models are good. They are really, really good. Unfortunately, the literature hasn't caught up with how good they are yet. Most publications and news articles claiming they aren't good enough test what are by now ancient models. Take this one that got some press attention: it used GPT-4o, Llama 3, and Command R+ (for some reason). And even then the bad results that they get are usually a result of bad prompting. The cited study is even an example of that, even those models correctly identified the relevant conditions in 94.9% of cases, they just didn’t do so robustly when prompted badly. There are unfortunately no truly up-to-date benchmarks about medical performance beyond OpenAIs healthbench (which Fable seemed to score no better on than Opus 4.8, I suspect its saturated due to bad design), which only measures what a bunch of doctors thought the "right" way to think about a case was.
But in medicine, the only thing that truly matters is making the correct diagnosis, or, failing that, taking the correct steps to unveil it. There is no opinion to it; reality is the way it is and that's that, figure it out or die. And when you test LLMs on actual cases, models have long outperformed MDs. The benchmark I linked used contemporary cases to try to limit data contamination, and as far as I can tell the models are just as good on genuinely fresh case reports. Just to illustrate my point, this benchmark is also the most up-to-date case-based benchmark I am aware of testing this issue, and the most recent frontier model it tested was o3 (alongside Gemini 2.5 Pro).
All of this does not mean that LLMs outperform an entire hospital system, which is what models are truly competing with, but it does mean that they are at least good enough to have an opinion on the matter. And to be fair, there are particular areas of medicine that seem to be slightly out of distribution. The example I used to use was a private image of a pathological blood smear, taken with a smartphone through a microscope and shared in the department as a puzzle case. It used to stump GPT-5.2 (which consistently misinterpreted it as CML when it was AML, an extremely consequential difference) but was solved by Gemini and Claude, which I thought was neat. These days, however, all the SOTA models (besides Fable, which of course refuses to answer) present very reasonable top-3 differentials when asked from their respective harnesses. So honestly I don't even know what's out of distribution anymore, surely there has to be something.
So, the models are good. They are so good that finding cases they can't solve correctly is starting to become difficult. And besides the potential consequences for my future job security, this is an amazingly positive development! Medicine is not about employing...