Building Blocks of GenAI Product Evaluation

Yinghong Lan

SubscribeSign in

Building Blocks of GenAI Product Evaluation The offline stack - rubric, guideline, judge, annotator, benchmark, all resting on a well-built eval set - and the online A/B tests that keep it honest, illustrated with image generation.

Yinghong Lan Jun 23, 2026

I’ve previously written an introduction to multimodal LLM-as-a-judge [LINK], and a deep dive into long-form video understanding - the algorithms [LINK] and the benchmarks [LINK]. This post zooms out to the big picture of GenAI evaluation: Why it’s the foundation of product and technical strategy, not a report card (Section 1).

How the building blocks compose into a single evaluation system (Section 2): A rubric and a guideline define what you measure and how to apply it consistently. (A checklist is a special case of a rubric - see below.)

An LLM/MLLM judge and a human annotator are the two kinds of rater who apply them (Here the annotator is critiquing or scoring model output - distinct from the annotator who produces ground-truth labels.).

A benchmark works by freezing the whole package: the data, criteria, and scoring mechanism - whether an auto-rater, stored annotator labels (industry's "golden set"), or both - are locked together so results can be compared over time. That reusability is useful for short-term iterations, but it also limits the value of benchmarks over a longer product lifecycle, where everything else evolves.

These five building blocks are all offline - they score quality on a eval set . The online half is the A/B test on real users and creators, where product impact is estimated with causal inference (Section 3).

Which block you reach for is two questions at once (Section 4): A measurement question - does this instrument measure the right thing (validity ), and measure it consistently (reliability )?

An economics question - what does each evaluation cost in money, turnaround time, and updating effort?

What this looks like in practice, using image generation as the worked example throughout - chosen to offer a different angle (generation) from long-form video understanding.

One scope note before we go further: this post is about quality evaluation - is the output good, and does “good” actually move the product metric we care about? I’m deliberately not covering safety or policy evaluation here. Not because safety is secondary - it isn’t - but because it is a different problem with a different risk, impact, and cost profile. It deserves its own treatment, and folding it into this post would make the argument longer and less clear. Thanks for reading! Subscribe for free to receive new posts and support my work.

Evaluation is the foundation for product and technical strategy

In generative AI products, evaluation is not the scoreboard after the game. It is the mechanism that decides what game the team is actually playing. That statement is not an exaggeration - whatever your evaluation rewards becomes the product spec in practice. For example, if your image-generation eval scores prompt adherence but never aesthetic coherence, you will ship a model that follows instructions and looks wrong. The evaluation system operationalizes the project objectives - design it carelessly and you optimize for the wrong thing without knowing it. GenAI evaluation is much more than measuring models - it needs to drive (at least) four different decisions: Capability measurement and offline model selection → technical feasibility and design: is this even buildable, and which approach do we bet on?

Launch-readiness evaluation → the ship/no-ship decision: have we closed enough of the gap to launch?

Online behavior and feedback → which product gaps to close next: which user and creator needs are we still not meeting today?

Failure-mode diagnosis and root-cause analysis → R&D prioritization: what do we fix or build next?

These four are not a coincidence - they are the product lifecycle reflected in the evaluation system. And they all draw on two different regimes of measurement. Offline: eval set, rubric, guideline, judge, annotator, benchmark. Everything offline rests on the eval set : its coverage, difficulty, and match to production traffic decide whether the whole stack measures anything real.

Online, you measure impact on real users and creators. Online has two modes: the A/B test asks is the new system better? (a controlled, causal comparison), and continuous monitoring asks is the live system still good? (drift, regressions, failures that appear only in production). This article focuses on the A/B test; continuous monitoring is its own topic and I won't go into it here. The offline metrics are proxies for these online outcomes - and the online metrics are themselves proxies for long-term user and creator value. Engagement can rise while value falls (e.g., novelty effect).

These two regimes hand off across the lifecycle rather than running in...

Building Blocks of GenAI Product Evaluation

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI