Learning to Replicate Expert Judgment in Financial Tasks

Learning to Replicate Expert Judgment in Financial Tasks - Thinking Machines Lab

Sarah Su, Kevin Zhu, Emily Xiao, Rohan Alur, Daniel Kang (Bridgewater AIA Labs) in collaboration with Thinking Machines

Jun 30, 2026

Judging information

Outperforming the market is hard. When every investor has access to the same sources of public information, alpha must come from unique insight built on taste and judgment. A strong investor’s judgment is difficult to articulate and teach directly to others, whether human or AI. It comes from experience.

Even when we decompose an investor’s job into its simplest constituent tasks, those tasks turn out to be surprisingly difficult for LLMs. In this post, we consider a simple special case: filtering and processing financial documents to surface information relevant to investment decisions.

Investors are bombarded with information every day: news articles, research reports, company documents, emails, internal write-ups, and more. Reading is the easy part. The real work is the small, repeated judgments carried over it — filtering, interpreting, segmenting, and identifying where the useful signal lies. These judgments are embedded throughout an investor’s daily workflow and consume substantial time.

We wanted to see if we could automate the information triage task: identifying what is relevant and interesting to read. This alone could greatly augment investors’ productivity, letting them spend their freed up attention on higher-level synthesis and decision making.

Given that LLMs perform poorly on simple financial tasks, we asked: is it possible to teach LLMs financial judgement? We find that with high-quality human annotations , we can teach LLMs to interpret text with expert-level taste and judgement. Our proprietary model outperforms all frontier models we tested on information accuracy and recall, at a fraction of their cost.

We describe our training process and results on a subset of data cleared for public release. Based on our results, we further describe the seeds of a vision of differentiated intelligence, with models tuned for specific organizational needs.

Frontier model performance

We evaluated models on six information filtering tasks drawn from investors’ daily workflows. Beyond these tasks, we have many others internally that show similar patterns to these six tasks: frontier models we tested on underperform compared to our internally trained models.

We measured accuracy — the percentage of documents that were correctly labeled according to our investors. For classification tasks, we also calculated the F1 score.F-score (Wikipedia).

01 Financial Article Relevancy

Given a financial article, classify whether it is relevant to a C-suite investment professional.

EVAL METRICS

F1 score, Accuracy

02 Central Bank Document Relevancy

Given a central bank document, classify whether it signals the direction of future interest rate changes.

EVAL METRICS

F1 score, Accuracy

03 Generic Document Relevancy

Given an investor's question and a research document, classify whether the document helps answer it.

EVAL METRICS

F1 score, Accuracy

04 Ad Hoc Content Labeling

Research documents are either recurring (repeated boilerplate) or mixed (boilerplate plus one-off, issue-specific analysis). Classify which, and find the last page of issue-specific content.

EVAL METRICS

Accuracy

05 Document Truncation

Identify where boilerplate content begins in a document.

EVAL METRICS

Exact Match Accuracy

06 Email Truncation

Identify where boilerplate content begins in an email.

EVAL METRICS

Exact Match Accuracy

The six financial tasks we evaluate in this blog post, each drawn from the routine work of an investor.

These tasks are trivial for investors, but they get stuck when articulating their decision process. Consider the following example of classifying a news article as relevant to an investment professional below:

Not relevant

ft.com Trump insists Greenland is his

Relevant

ft.com US stocks close sharply lower after Trump threatens new China tariffs

Example of judging the relevance of a financial article to US markets. Source: Financial Times.

The Greenland example is unlikely to be taken seriously given the context of the article, while the China tariffs are highly relevant. Yet both examples touch on geopolitics and finance.

In contrast to our investors, frontier models we tested on perform surprisingly poorly. Variants of Gemini, Claude, and GPT averaged a mere ~50% accuracy when given a prompt that simply states each of the six tasks to perform.

We first tried to improve LLM performance with stronger prompting. Our experts wrote instructions based on real task descriptions, and also suggested reframing certain tasks. For example, while an article about a small IPO is clearly...

Learning to Replicate Expert Judgment in Financial Tasks

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level