The Model Is No Longer the Bottleneck | K-Dense Web<br>Skip to main content
This week Anthropic published a quiet result that should change how scientists think about AI. In Making Claude a chemist, they put a general-purpose model with no chemistry fine-tuning up against the dedicated software that chemists have relied on for decades, and it held its own, beating that software outright in several places.
Most people will take that as the headline, but we think it points somewhere more interesting. The real story is not that an AI can do chemistry, it is that the hardest part of the problem has moved. For years the limiting factor in scientific AI was the raw capability of the model itself, and that era is now ending, because what stands between today's models and real scientific work is no longer intelligence but the workflow built around it.
What the chemistry result actually showed
Figure 1. Forward prediction goes from structure to spectrum, the task dedicated software was built for. Inverse elucidation runs it backward, from spectrum to structure. A general model is now competitive at the first and capable at the second.
Anthropic tested three models against ChemDraw and MestReNova on NMR, one of the most common and most time-consuming analytical tasks in synthetic chemistry, and the results are worth stating plainly.
On forward prediction, taking a known structure and predicting where every hydrogen and carbon peak will fall, the best model (Opus 4.7) was the most accurate tool tested on hydrogen, with an average error of ±0.079 ppm, and was effectively tied with MestReNova on carbon. On the shape of the peaks, the splitting patterns and sub-peak spacing that carry structural information, all three models predicted the spacing to within half a hertz roughly 80% of the time, against 26 to 35% for the classical tools.
Then they ran the problem backwards. Given only a molecular formula and a 1D spectrum, could the model propose the structure that produced it? This is the inverse problem, the one existing software largely leaves to the human. The model recovered all eight of the simpler targets on every attempt from spectra and formula alone, and solved most of the harder fused-ring and spirocycle targets when given the starting material as a hint.
The point that matters here is the one Anthropic makes themselves: this is a general model with no chemistry-specific training, and it does from a pasted spectrum what used to require licensed, specialized, single-purpose tools.
The team is also refreshingly honest about the limits. The evaluation was small, with 20 compounds for the forward task and 15 for the inverse, two-dimensional experiments and stereochemistry were out of scope, and solvent coverage was narrow. On the densest inverse problems, without the starting material as a clue, the model would sometimes loop through its reasoning without ever committing to a final answer. These are real caveats, but that last one is worth holding onto, because it is not a knowledge problem but a workflow one, and we will come back to it.
This is not a chemistry story
It is tempting to file this under chemistry and move on. That would be a mistake, because the same pattern is showing up across the sciences.
We saw it in biology. On BixBench-Verified-50, a cleaned benchmark of real bioinformatics tasks, a generalist system scored 90%, ahead of specialized agents, without being tuned for the benchmark. The chemistry result is the same shape in a different field. A general model, asked to do work that a domain expert assumed required a dedicated tool, turns out to be competitive or better.
When the same surprising result appears in chemistry, in biology, and in domain after domain, it stops being a surprise and becomes a trend, which means the capability is general and it is already here.
So the interesting question is no longer whether the model can do it, because increasingly the answer is yes, but what has to be true for the answer to be trustworthy, complete, and reproducible, and that turns out to be a very different question with a very different answer.
The gap between an answer and a result
Figure 2. A chatbot stops at a plausible answer. A research result requires the whole chain of evidence behind it, and the model is only one link in that chain.
A chat response is not a research result. A research result is a chain of evidence, made up of the right data pulled from the right sources, the right method chosen and run, the output checked against what is already known, and a final answer you can defend and reproduce, and the model is only one link in that chain, however crucial.
Look again at the failure Anthropic reported, the model looping on the hardest structures without committing. That is not a gap in chemical knowledge but the absence of a system around the model that forces a decision, tests the candidates against the spectrum, and rules options out until one survives. Give a capable model...