How we built a lab to evaluate data agents | Hex<br>Skip to main content
Blog<br>How we built a lab to evaluate data agents<br>Inside Hex's eval architecture and the synthetic business it runs on.<br>Izzy Miller<br>Engineering<br>May 22, 2026
Share:twitterlinkedin
Get started for free
Data analytics is a uniquely cursed domain for agents to operate in.<br>Easy questions look hard. Hard questions look easy. Many questions are impossible to answer; to even try is to fail. Bugs are usually silent and subtle. Innocuous assumptions (LLM’s favorite!) make or break analyses. There are no linters, no test suite, no formalization language. There is almost no realistic public data to train on or build environments from, and there is a surplus of unrealistic tutorial-slop jamming up the pretrain. Everyone’s data warehouse is out of distribution. For every right answer, there are ten plausible but subtly incorrect wrong answers, and no way to verify or validate the result.<br>This is the complicated and exciting world that Hex lives in!<br>In this particularly challenging domain, having good evals is crucial to know if we’re improving our agents and delivering a good experience to users. Increasingly, we are realizing that the performance of agents in Hex is more a function of the rich context stores they access than their system prompts or the models powering them, so evaluating “an agent” winds up looking more like evaluating the entire Hex system and context flywheel in the context of a user’s data warehouse.<br>To support these kinds of complex evals without sacrificing flexibility and speed, we’ve had to build a lot of custom infrastructure and even create an entire fake business! Here’s what our setup looks like.<br>The Shoebox - Hex’s eval infrastructure and lab bench<br>Last year, just before we launched the Notebook Agent as an internal-only prototype, I built a very hacky tool to let us view agent traces. I called it “The Shoebox”, in honor of the boxes full of messy receipts we all have in our closets, and in hopes that it would just be a temporary stopgap solution until we purchased some real and presumably glorious product that would solve all our problems.<br>In the year since, rather than replace it, we’ve evolved Shoebox into a full-fledged lab bench for agent observability and evaluation. It powers ad-hoc and scheduled evaluations for all our agents, supports all the experimental treatments and pairwise comparisons you could hope for, and even exposes a set of agent skills that let coding agents experiment against our evals in an autoresearch-like loop.<br>Everything about Shoebox is designed to help users think about evaluations as pairwise experiments with a “candidate” and a “baseline” run rather than standalone tests.<br>It runs as part of the local Hex dev stack for maximum flexibility, but connects to our shared internal Hex workspace, where eval sets run daily to establish shared “production baselines” accessible to everyone.<br>This lets most people use an efficient hybrid workflow where they compare locally executed candidate runs against remotely executed production baselines. Even if 10 people are running 25 experiments between them, the baseline holds (relatively) static across the entire population, and we can talk objectively about them all. We take great pains to ensure that everyone’s local environments are synced with one another and with the remote environment, so we are always comparing apples to apples — this, in particular, is a nontrivial problem that we’re still working to perfect.<br>The upshot of this model is that it’s easy for anyone to spin up a new branch, make some code changes, and then run evals against a shared, consistent remote baseline. It’s also just as simple to test a new model configuration or system prompt change as it is to test a change to workspace context, some new memory system, or search infrastructure.<br>… or for our PMs to peruse evals before bed<br>Our eval sets are also relatively small compared to public benchmarks. Good evals are very hard to write, and we prefer to artisanally craft strong, broadly applicable evals that are rich enough for people to get an arbitrary signal out of by creating new rubrics on top of, rather than regularly making net new evals.<br>To facilitate this, our core eval sets ship with a preconfigured set of rubrics and ground truths, but anyone can configure their own deterministic, LLM-judged, or hybrid rubrics— or, if they must, create new evals and eval sets. Most eval sets we have now run with lots of additional rubrics like ToolEfficiency, SemanticLayerUsage, WorkspaceGuideAdherence, and other things that are of interest to a particular team or product area, and are weightless to tack onto the existing corpus of evals.<br>To further minimize unnecessary eval and rubric sprawl, users can even create flexible run-scoped “hypothesis objective” rubrics that allow for more targeted pairwise evaluation scoped to a particular experiment. These LLM-judged rubrics consider a candidate...