How to do evals for finance | Primer
They say you’re the average of the five people you spend the most time with. If that’s true, we’re all slowly becoming a weighted average of our AI agents.
I definitely feel it, having started saying “canonical” in social settings.
After leaving the hedge fund desk three years ago, I’ve spent most of my waking hours prompting LLMs, testing AI agents, and evaluating stock research. From promising $1,000 tips to GPT-3.5, all the way to working with today’s agents, harnesses, and tools.
It’s taken a toll. But it’s also forced me to go deep on both finance and AI, and to form views on what “good” equity research actually looks like — and how to evaluate it.
The challenge is that most publicly available “finance AI” benchmarks fail at one key thing: capturing nuance. And when it comes to investing, nuance matters.
So I built my own internal evaluations. There’s still a lot of work to do, but these are the core lessons so far:
Absolute scoring fails past a certain quality threshold - the “max out” problem
Use relative scoring to capture nuance and compare agents - but do it right
Use the strongest frontier models as your judges - forget about costs
Give the judges access to raw data - they can’t take things at face value
Variance applies to both contenders and judges - act accordingly
Outcomes matter, but how you get there is also key - evaluate the agent’s path
What’s next? Live earnings coverage - the beginning of truly autonomous research
Why Evaluating Equity Research Is Hard
Deep equity research is fundamentally different from most tasks we evaluate LLMs on.
No single answer
There is no deterministic verification. You cannot just check whether the note is “correct”, because in many cases there is no single correct answer. There are judgment calls, different valid framings, and legitimate disagreements between smart analysts.
One analyst might treat margin pressure as temporary overinvestment. Another might see it as evidence of structural competition. Both can be financially literate.
This is why scoring against a rubric hits a ceiling very quickly. Once an agent is basically competent (it applies the right methodology, does the math correctly, and presents a financially defensible case) absolute scores stop differentiating well. Two reports can both tick every box, and still be different in quality.
Judges need competition
What matters is not whether multiple reports collapse the uncertainty into the same answer. It’s whether the research improves your map of the possible outcomes — surfacing scenarios most missed, assigning sensible probability weights, and identifying what can move the odds.
Imagine you’re judging 2 reports. You read report A, it’s really good, you score it 9/10. Then someone erases your memory, you read report B and also score it 9/10. But when reading them side by side, you realise that B is better than A. Same absolute score, different quality.
LLM judges (which simply means AI models that act as a grader for the work of other AIs) face the same problem. When asked to score individual research in isolation, they struggle to grade quality past “good”.
The distinction of great vs good is exactly where equity research lives.
The Problem with Financial Benchmarks
Many public finance benchmarks, especially the ones that scale cleanly, gravitate toward tasks with deterministic answers.
Retrieval ≠ finance
Some are basically retrieval tests in disguise. “What dividend did Company X declare in 2023?” “What was revenue in the latest quarter?”
These are useful tests. They measure whether an agent can find the right fact and cite the right source. But it is very different from measuring investment judgment.
At Primer, we’ve maxed out FinRetrieval and scored 100%. That was a real achievement, and it showed that our agent could retrieve financial facts reliably. But you cannot really observe GPT-5.5 performance versus GPT-5.4 with that kind of benchmark.
Unit-tests ≠ judgment
Another popular category is model-building. A lot of these benchmarks are closer to Excel formula verification: can the agent link cells correctly, build a three-statement model, refresh a model with newly disclosed numbers?
Again, useful. But mostly as a “unit test”. They do not test whether the agent can actually model a company in the investment sense: make reasoned assumptions, understand drivers, and forecast sensibly.
There are also benchmarks around junior investment banking tasks: presentations, decks, market summaries, and similar outputs. Those can be valuable, but they still do not answer the core question: how do you measure nuance inside actual investment research?
Why Internal Benchmarks
For a long time, I did what many people do: I eyeballed outputs. Too much, honestly.
Having domain knowledge can be a curse here. It works. You can read a report for a company you know inside-out, and see whether it surprises you. But it is not scalable,...