Reading of OpenAI's Self-Improving Tax Agents

Olshansky1 pts0 comments

Reading of OpenAI's Self-Improving Tax Agents | šŸ¦‰ olshansky 🦁Table of Contents

When I first saw the title of OpenAI&rsquo;s blog, I wasn&rsquo;t too excited. Here it is: Building self-improving tax agents with Codex.<br>Tax agents don&rsquo;t sound exciting. Everything is self-improving. I got it. But this short post is a masterclass in practical, hands-on learnings on the latest tools, workflows, and approach to how to do evals for real-world use cases. It&rsquo;s not just technical; it builds intuition by relying on real-world experience.<br>This is the post I wish I had when I first started my job at Waymo.<br>Finding the Right Hill to Climb#<br>Any ML/AI practitioner would appreciate this strong open:<br>&ldquo;Real-world systems behave differently in production than they do in a lab, breaking in ways that are hard to anticipate before deployment.&rdquo;"

To get things started: the problem wasn&rsquo;t &ldquo;Can we use AI for taxes?&rdquo;, it was:<br>&ldquo;They pointed us to tax preparation as a significant bottleneck during the busiest stretch of tax season.&rdquo;

Without getting into the weeds, the core problem involves uploading a bunch of documents -> accurately parsing and extracting data from PDFs, and measuring accuracy by looking at data completion % along with returns flagged for manual review.<br>Like any real-world eval problem, there are three key pillars to the feedback loop:<br>Feedback : To make sure you&rsquo;re solving the right problem(s), in the right order, in a domain you might not be an expert in, you need to work as closely as possible with the end customer. This drives the data you&rsquo;ll be looking at.<br>Eval : Production traces, hold-out sets, golden test set corpuses, and much more. This requires trial-and-error along with experience. Feedback will drive what you should evaluate.<br>Self-improvement : You need the model to actually improve. This might be model routing, prompt tuning, post-training (e.g. fine-tuning), setting thresholds based on the evals, etc. This is what actually drives the results.<br>I really like this quote at the end of the problem statement:<br>&ldquo;We did not have the signal to identify the right hill to climb.&rdquo;

That gap is what drove their methodology, one applicable to any field.<br>This is the playbook for Forward Deployed Engineers in the era of AI:<br>Stay close to practitioners<br>Build the product so production creates evidence<br>Create a Codex-driven improvement loop<br>From Production Traces to Evals#<br>The post goes into a concrete example on managing rental properties where the financial details may be spread across multiple documents: spreadsheets, PDFs, hand-written notes, etc&mldr;<br>The interesting part about capturing your precision and recall is that negative samples are not always objective, they may be practitioner preferences.<br>This is a very simple but very real framing. It&rsquo;s why we&rsquo;ll need fine-tuning per organization. Maybe, one day, per team? Per person? People laughed at the idea of &ldquo;personal computers&rdquo; too. I&rsquo;ll share more of my thoughts here in another post.<br>OpenAI&rsquo;s note on turning product traces into evals is split into three:<br>Capture the difference between the filed return (by the human) and Tax AI&rsquo;s output. This is great, the final return is the best sample we can get for our ground truth corpus.<br>Group related failures to separate recurring errors from workflow noise. This helps with understanding the data, but also avoids overtraining/overfocusing on a specific vertical of failures.<br>Turn repeated patterns into eval targets after review and measurement. Once you&rsquo;ve found the hill, and you have the tools, and you have the baseline skills, climbing is quite fun.<br>As I finished writing the note above, and moved on to the next sentence, I saw: &ldquo;Corrections become hill candidates.&rdquo; :) OpenAI provided a really cool graphic that I&rsquo;ll simply link to here.<br>The next step is where most of the engineering work and time lies:<br>Investigate the pipeline : packages, schemas, code paths, bugs, traces, logs, deployment environments, offline/online skew, etc.<br>Implement the fixes : Update source selection, parsing, tax-engine, etc.<br>Validate & propose : Re-run targeted evals, measure, run regression suites, consider deploying with an A/B test, shadow traffic, re-evaluate.<br>Close the loop : Automate as much (not everything) as possible to have new metrics, test sets, dashboards, and visibility into the deployed changes.<br>This next diagram I liked enough to drop inline:

Why Forward Deployed Engineers Aren&rsquo;t Going Anywhere#<br>Like I said earlier on, this isn&rsquo;t a Codex-specific blog. It is a blog to provide a view into what eval feedback looks like in the real world. The team says:<br>&ldquo;The rental property example is emblematic of a broader reusable pattern: using production artifacts and traces to improve an agent’s capabilities.&rdquo;

And if anyone ever asks why Forward Deployed Engineers aren&rsquo;t...

rsquo ldquo rdquo real openai self

Related Articles