Lessons from Building Evals for Financial AI Agents

How to do evals for finance | Primer

They say you’re the average of the five people you spend the most time with. If that’s true, we’re all slowly becoming a weighted average of our AI agents.

I definitely feel it, having started saying “canonical” in social settings.

After leaving the hedge fund desk three years ago, I’ve spent most of my waking hours prompting LLMs, testing AI agents, and evaluating stock research. From promising $1,000 tips to GPT-3.5, all the way to working with today’s agents, harnesses, and tools.

It’s taken a toll. But it’s also forced me to go deep on both finance and AI, and to form views on what “good” equity research actually looks like — and how to evaluate it.

The challenge is that most publicly available “finance AI” benchmarks fail at one key thing: capturing nuance. And when it comes to investing, nuance matters.

So I built my own internal evaluations. There’s still a lot of work to do, but these are the core lessons so far:

Absolute scoring fails past a certain quality threshold - the “max out” problem

Use relative scoring to capture nuance and compare agents - but do it right

Use the strongest frontier models as your judges - forget about costs

Give the judges access to raw data - they can’t take things at face value

Variance applies to both contenders and judges - act accordingly

Outcomes matter, but how you get there is also key - evaluate the agent’s path

What’s next? Live earnings coverage - the beginning of truly autonomous research

Why Evaluating Equity Research Is Hard

Deep equity research is fundamentally different from most tasks we evaluate LLMs on.

No single answer

There is no deterministic verification. You cannot just check whether the note is “correct”, because in many cases there is no single correct answer. There are judgment calls, different valid framings, and legitimate disagreements between smart analysts.

One analyst might treat margin pressure as temporary overinvestment. Another might see it as evidence of structural competition. Both can be financially literate.

This is why scoring against a rubric hits a ceiling very quickly. Once an agent is basically competent (it applies the right methodology, does the math correctly, and presents a financially defensible case) absolute scores stop differentiating well. Two reports can both tick every box, and still be different in quality.

Judges need competition

What matters is not whether multiple reports collapse the uncertainty into the same answer. It’s whether the research improves your map of the possible outcomes — surfacing scenarios most missed, assigning sensible probability weights, and identifying what can move the odds.

Imagine you’re judging 2 reports. You read report A, it’s really good, you score it 9/10. Then someone erases your memory, you read report B and also score it 9/10. But when reading them side by side, you realise that B is better than A. Same absolute score, different quality.

LLM judges (which simply means AI models that act as a grader for the work of other AIs) face the same problem. When asked to score individual research in isolation, they struggle to grade quality past “good”.

The distinction of great vs good is exactly where equity research lives.

The Problem with Financial Benchmarks

Many public finance benchmarks, especially the ones that scale cleanly, gravitate toward tasks with deterministic answers.

Retrieval ≠ finance

Some are basically retrieval tests in disguise. “What dividend did Company X declare in 2023?” “What was revenue in the latest quarter?”

These are useful tests. They measure whether an agent can find the right fact and cite the right source. But it is very different from measuring investment judgment.

At Primer, we’ve maxed out FinRetrieval and scored 100%. That was a real achievement, and it showed that our agent could retrieve financial facts reliably. But you cannot really observe GPT-5.5 performance versus GPT-5.4 with that kind of benchmark.

Unit-tests ≠ judgment

Another popular category is model-building. A lot of these benchmarks are closer to Excel formula verification: can the agent link cells correctly, build a three-statement model, refresh a model with newly disclosed numbers?

Again, useful. But mostly as a “unit test”. They do not test whether the agent can actually model a company in the investment sense: make reasoned assumptions, understand drivers, and forecast sensibly.

There are also benchmarks around junior investment banking tasks: presentations, decks, market summaries, and similar outputs. Those can be valuable, but they still do not answer the core question: how do you measure nuance inside actual investment research?

Why Internal Benchmarks

For a long time, I did what many people do: I eyeballed outputs. Too much, honestly.

Having domain knowledge can be a curse here. It works. You can read a report for a company you know inside-out, and see whether it surprises you. But it is not scalable,...

Lessons from Building Evals for Financial AI Agents

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org