Learning to Replicate Expert Judgment in Financial Tasks - Thinking Machines Lab
Learning to Replicate Expert Judgment in Financial Tasks
Sarah Su, Kevin Zhu, Emily Xiao, Rohan Alur, Daniel Kang (Bridgewater AIA Labs) in collaboration with Thinking Machines
Jun 30, 2026
Judging information
Outperforming the market is hard. When every investor has access to the same sources of public information, alpha must come from unique insight built on taste and judgment. A strong investor’s judgment is difficult to articulate and teach directly to others, whether human or AI. It comes from experience.
Even when we decompose an investor’s job into its simplest constituent tasks, those tasks turn out to be surprisingly difficult for LLMs. In this post, we consider a simple special case: filtering and processing financial documents to surface information relevant to investment decisions.
Investors are bombarded with information every day: news articles, research reports, company documents, emails, internal write-ups, and more. Reading is the easy part. The real work is the small, repeated judgments carried over it — filtering, interpreting, segmenting, and identifying where the useful signal lies. These judgments are embedded throughout an investor’s daily workflow and consume substantial time.
We wanted to see if we could automate the information triage task: identifying what is relevant and interesting to read. This alone could greatly augment investors’ productivity, letting them spend their freed up attention on higher-level synthesis and decision making.
Given that LLMs perform poorly on simple financial tasks, we asked: is it possible to teach LLMs financial judgement? We find that with high-quality human annotations , we can teach LLMs to interpret text with expert-level taste and judgement. Our proprietary model outperforms all frontier models we tested on information accuracy and recall, at a fraction of their cost.
We describe our training process and results on a subset of data cleared for public release. Based on our results, we further describe the seeds of a vision of differentiated intelligence, with models tuned for specific organizational needs.
Frontier model performance
We evaluated models on six information filtering tasks drawn from investors’ daily workflows. Beyond these tasks, we have many others internally that show similar patterns to these six tasks: frontier models we tested on underperform compared to our internally trained models.
We measured accuracy — the percentage of documents that were correctly labeled according to our investors. For classification tasks, we also calculated the F1 score.F-score (Wikipedia).
01<br>Financial Article Relevancy
Given a financial article, classify whether it is relevant to a C-suite investment professional.
EVAL METRICS
F1 score, Accuracy
02<br>Central Bank Document Relevancy
Given a central bank document, classify whether it signals the direction of future interest rate changes.
EVAL METRICS
F1 score, Accuracy
03<br>Generic Document Relevancy
Given an investor's question and a research document, classify whether the document helps answer it.
EVAL METRICS
F1 score, Accuracy
04<br>Ad Hoc Content Labeling
Research documents are either recurring (repeated boilerplate) or mixed (boilerplate plus one-off, issue-specific analysis). Classify which, and find the last page of issue-specific content.
EVAL METRICS
Accuracy
05<br>Document Truncation
Identify where boilerplate content begins in a document.
EVAL METRICS
Exact Match Accuracy
06<br>Email Truncation
Identify where boilerplate content begins in an email.
EVAL METRICS
Exact Match Accuracy
The six financial tasks we evaluate in this blog post, each drawn from the routine work of an investor.
These tasks are trivial for investors, but they get stuck when articulating their decision process. Consider the following example of classifying a news article as relevant to an investment professional below:
Not relevant
ft.com<br>Trump insists Greenland is his
© Jeremy Banx
Relevant
ft.com<br>US stocks close sharply lower after Trump threatens new China tariffs
Biggest one-day drop in S&P 500 since April brings weeks long rally to a halt © AFP/Getty Images
Example of judging the relevance of a financial article to US markets. Source: Financial Times.
The Greenland example is unlikely to be taken seriously given the context of the article, while the China tariffs are highly relevant. Yet both examples touch on geopolitics and finance.
In contrast to our investors, frontier models we tested on perform surprisingly poorly. Variants of Gemini, Claude, and GPT averaged a mere ~50% accuracy when given a prompt that simply states each of the six tasks to perform.
We first tried to improve LLM performance with stronger prompting. Our experts wrote instructions based on real task descriptions, and also suggested reframing certain tasks. For example, while an article about a small IPO is clearly...