Giving a domain a hill to climb: benchmarking as data activation

benchmarking is the new data activation · Sparse Thoughts

benchmarking is the new data activation | Sparse Thoughts

benchmarking is the new data activation

03 Jul 2026

health, evaluation

Listen to this post

0:00

0:00<br>1×

give me an optimizable metric and i’ll move the world.

benchmarking, i.e., the act of building and applying benchmarks, is a new(ish) form of data activation: a way of turning domain data into something models can be measured against, ranked by, and eventually trained on. i wrote about data activation a while back, and i want to come back to it, because i think benchmarking is one of the cleaner examples of the thing i was reaching for then.

start with where models actually get good. the places where large language models have improved fastest, and keep improving, are the ones with a verifiable hill to climb: coding first (probably the most verifiable task there is), then math, then optimization problems with a clear target. the common thread is a metric that tracks the target well enough for optimization to mean something. Archimedes wanted a place to stand; a benchmark is one example of such a place.

most complex domains don’t come with that hill. medicine and biology are the ones i care about, and those don’t natively have one: the “substrate” is messy, longitudinal, mostly not sitting in the model’s input space the way code is. so the interesting question, before any of the RL machinery, is whether you can give the domain a hill at all.

the first (and most basic) insight i want to start with is that just being able to measure is already a basic form of activation. if you take health data: the structured records, the workflows, the information sitting around already, and you turn it into something you can score a model against (what does this system actually know and where does it fail), you have pulled value out of that data even though nothing inside the model has changed. this is a looser sense of “activation” than the one in that first post, where it meant getting data into the weights as supervision. here, the data activates by becoming the surface models are measured, selected, and eventually trained against. right now, for health, we mostly cannot do even this first part well. we do not have a good answer to “what do these models know”, and being able to answer it is worth something on its own. the cliché holds (more than i’d like): you can’t improve what you can’t measure.1

verifiers are the same move taken one step further, to where it folds in on itself. when the benchmark is also an environment you can run reinforcement learning against,2 the score is no longer just an after-the-fact report. it becomes the reward. measuring the model and improving it stop being two separate things you do in sequence; they share the same substrate. that coupling is the power and the danger: the better the benchmark, the more useful the training signal; the worse the benchmark, the more faithfully you optimize the wrong thing.

i don’t think any of this is really an argument with the bitter lesson. it sits more orthogonal to it. the bitter lesson says general methods win once scale is available. the question underneath is the one that comes first: is the domain even in a shape where scale has “somewhere to go”. code and math are. medicine, mostly, is not. so part of the work is converting messy domain material into tasks with checkable outcomes, and benchmarking is one way to do that conversion. i put it this way recently and i still like it: sometimes the domain is already in a shape where scale can eat it, sometimes the work is turning the domain into a shape where scale has traction. what’s actually new,3 then, isn’t reinforcement learning, it’s RL-over-verifiers arriving in domains that were never natively language or code.

there’s more than one way to build that kind of scoring surface, and the ways are not interchangeable. i find it useful to lay them on a few axes: where the ground truth lives, what it costs to build, and how far it sits from the thing you actually care about.

at one end is the latchbio approach (SpatialBench-Long is the clearest recent example i’ve read). here the ground truth is rebuilt from the raw data itself: you hand the agent raw or near-raw data plus enough calibrated context to approximate what a scientist would know at the start, and you grade both the conclusion and the path to it. the special sauce is that claims from the original papers are candidates, rechecked against the data before they are allowed to be answers. it’s expensive, and it’s the closest to the real thing (how exactly to do this in medicine is kind of an open question, and how to do it at scale is a whole new question altogether).

in the middle, the HealthBench approach: rate conversations against rubrics written by physicians. the unit of work is clinical answer quality: did the response catch the relevant issue, communicate it safely, avoid harm, avoid...

Giving a domain a hill to climb: benchmarking as data activation

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI