Show HN: JazzBench, an LLM reasoning benchmark using jazz improvisation

mikerubini1 pts0 comments

I built my own Claude eval - FlatNine Blog

At AI week, Gian Segato from Anthropic said something offhand that I have not been able to put down. He mentioned that a lot of people inside Anthropic keep their own personal eval for Claude. Not the big public benchmarks. A small, private test, tuned to something they personally care about, that they trust more than any leaderboard to tell them whether a new model is actually better.

That one sentence reorganized how I think about model quality. So I went and built mine.

Why a personal eval beats vibes

Most of us judge models by feel. A new release drops, you throw a few of your favorite hard prompts at it, you get a vibe, you move on. The problem is that vibes do not scale and vibes drift. You cannot remember how the last model handled the exact same prompt three months ago, so "it feels smarter" is doing a lot of unearned work.

Public benchmarks have the opposite problem. They are rigorous, but they are not yours. They measure something generic, they leak into training sets, and they get gamed. A score going up on a public leaderboard tells you something, but rarely the thing you actually care about.

A personal eval splits the difference. It is small enough that you own every example. It is opinionated enough that the number means something to you specifically. And it is private enough that no model has been trained on it. When the score moves, you learn something real.

So I built mine, out of jazz

I decided to make mine out of the one domain where I trust my own judgment more than almost anyone's: jazz. I called it JazzBench, and the ground truth is Charlie Parker.

The task, in one sentence: given the first few chords of one of Parker's solos, plus the chord changes coming next, predict the actual notes Parker played over each of those next chords. Then score that prediction against what he really did on the record.

It is a strange thing to ask a language model to do, which is exactly why I like it.

Why jazz is a good test

Almost every eval out there tests verbal, mathematical, or coding reasoning. Those are hard-edged problems: there is a correct answer, and you mostly reach it by being careful and not making mistakes. Improvisation is a different kind of cognition. It is:

Bounded. The chord changes, the key, and the time are all fixed. You cannot just play anything.

Judgeable. We have Parker's actual solo as ground truth, plus formal music-theory methods to score how close a guess is.

Cognitively rich. It is constraint satisfaction, style, and creativity all at once, in real time, with no single right answer but plenty of obviously wrong ones.

That combination is rare. It is the kind of soft, multi-constraint judgment that human experts make intuitively and that almost no benchmark even tries to measure. If I want to know whether a model has taste under pressure, this is a far better probe than another word problem.

How it gets scored

Because there is no single correct answer, you cannot just check for an exact match. So each prediction is scored against what Parker played using five music-theoretic metrics:

PC Jaccard: the overlap between the notes the model predicted and the notes Parker actually played.

Interval-vector distance: how far apart those two note sets are in interval space, not just which exact notes they share.

Complexity and dissonance deltas: the error on Parker's own complexity and dissonance measures, so a guess can be "wrong notes, right texture".

Forte-class match: whether the predicted set has the same abstract shape as Parker's, regardless of transposition.

And it ships with three baselines to beat: sampling randomly from the notes Parker tended to use over that chord, always playing the single most common set for that chord, and a first-order Markov model over the previous segment. The bar is simple. If a frontier model cannot beat "just play the most common thing", it is not really improvising, it is averaging.

I ran Haiku 4.5, Sonnet 4.6, and Opus 4.7 through it. Watching where they land relative to those baselines tells me more about a model than most of what I read on launch day.

What the first batch showed

The one-line version: every Claude tier (Haiku 0.370, Opus 0.400, Sonnet 0.402) beat every baseline on pitch-class overlap with Parker, but Sonnet and Opus are statistically tied, and none of them matched the simple modal-PC-set baseline on interval texture or dissonance proximity. Claude has learned Parker's note vocabulary, but not his characteristic harmonic restraint.

The headline, on note overlap (Jaccard, higher is better): Sonnet 0.402, Opus 0.400, Haiku 0.370, against most-frequent 0.355, Markov 0.327, and random 0.319. Every Claude tier beats every baseline at picking Parker's actual notes, with zero parse errors across 399 agent calls. But the interesting stuff is in the texture, and there are five findings I did not expect (the full writeup is in the paper, section 8):

Every Claude tier...

parker model notes claude something eval

Related Articles