"It's Hard to Eval" Is a Product Smell

call-me-al2 pts0 comments

“It’s Hard to Eval” Is a Product Smell – Hamel's Blog

Subscribe To My Newsletter

For the past 3 years, AI evals have been my professional focus.1 The most common objection I hear to evals is “our product is hard to eval”.

This objection is a product smell. Artifacts that are hard for you to verify are often hard for users too. In the worst case, users have to redo the work from scratch to verify the output. More importantly, designing your product for ease of verification should come before building evals.

In this post, I’ll walk through three products I advised on that faced this issue. I’ll also show before and after sketches to demonstrate design principles. After these examples, I’ll discuss how to apply this general pattern to your product.

Example 1: the AI data agent

Almost every company I’ve worked with builds an internal AI data agent. You ask it a business question, like what was net revenue for Product A last quarter, and it finds relevant data sources, runs the queries, and provides an answer. The goal of this agent is to reduce dependency on data analysts.

A common mistake when building AI data agents is to make the answer the only output, as illustrated below.

Data Agent

What was net revenue for Product A last quarter?

Net revenue for Product A last quarter was $4.21M .

Ask anything about your business…➤

Since the only output is the answer, there is nothing here to check.

In the sketch above, the user has no way to verify the answer beyond redoing work.2 A better design is to provide the user with checkable artifacts, informed by how a domain expert might validate the output. Here are techniques I use to validate metrics as a data scientist:

Compare the quantity and any intermediate calculations against a trusted source, like a vetted dashboard or report, or a similar analysis a colleague has already vetted.3

Confirm the metric definition precisely. A number like net revenue can include or exclude things like returns and discounts.

Sanity-check a related quantity. If I can’t verify the number directly, I pull a related number that should move with it, like units sold or unique customers, and check if the combination is plausible.

Look at what is beneath the aggregate. A total can hide problems, so I break it down by dimensions like region or time period and sanity-check the distribution.

Read the query. For an important number I look at the SQL to confirm it does what I think, and I tweak it and rerun to test my assumptions.

Note anything I could not verify. If a step has no trusted reference to check it against, I flag it instead of presenting it as settled.

Here’s what a better interface might look like. The two tabs below show the same answer at two levels of detail. The chat reply surfaces the details worth seeing up front, and the notebook holds the full analysis behind the answer. Use the tabs to switch between them.

Chat<br>Notebook

Data Agent

What was net revenue for Product A last quarter?

Net revenue for Product A last quarter was $4.21M .ⓘ

Details

Adapted from “Quarterly revenue review”authored by Priya Nair · Jun 28, 2026<br>Open ↗

Assumptions

Metric definitiongross − returns − discounts<br>matches governed definition

Intermediate calculations

Returns netted out−$0.7M<br>no trusted sourceⓘ

Unique customers12,480<br>183 unmatchedⓘ, CRM ↔ Billing

Open as notebook

The agent surfaces the important details behind the answer. Select the Notebook tab above to see the full analysis generated by the agent.

This is the notebook the agent worked in while producing its answer. Scroll within the figure to see all the cells. The sidebar has a Contents tab for jumping between sections and an Assistant tab for asking follow-ups.

net_revenue_product_a.ipynb

▶ Run all<br>↑ Publish

Adapted from Quarterly revenue review ↗, authored by Priya Nair · Jun 28, 2026

MarkdownBiH🔗☰

Net revenue — Product A, Q4 FY25

“What was net revenue for Product A last quarter?”

Short answer: $4.21M . This notebook shows how that number was built and what was checked against a trusted source.

1. Metric definition

Net revenue is gross − returns − discounts. Read the definition from the governed metrics layer so this matches what finance reports.

[1]

Python▶Code collapsed✦ Ask AI ⌘K<br>import yaml

defn = yaml.safe_load(open("metrics/net_revenue.yml"))<br>defn["expr"], defn["source"]

('gross - returns - discounts', 'finance.order_lines')

2. Net revenue for Product A

Pull net revenue for Product A for the quarter, straight from the order lines.

[2]

SQL▶Code collapsed✦ Ask AI ⌘K<br>SELECT SUM(gross - returns - discounts) AS net_revenue<br>FROM finance.order_lines<br>WHERE product = 'Product A'<br>AND fiscal_quarter = 'Q4-FY25';

net_revenue

4,210,442

That's the $4.21M the chat answer reported.

3. Sanity-check the distribution

Break Product A down by region, then plot it. Nothing should look out of place against last quarter's mix.

[3]

SQL▶Code collapsed✦ Ask AI ⌘K<br>SELECT region,<br>SUM(gross - returns -...

product revenue answer data agent quarter

Related Articles