Realistic Evals, or You're Blind

zozo123-IL1 pts0 comments

Evals or You're Blind — Pydantic AI × Pydantic Evals × crabbox

pydantic ai × pydantic evals × crabbox

Realistic evals,<br>or you're blind.

A runnable demo built from the seven anti-lessons of<br>"7 Anti-Lessons from Building a Pydantic AI Agent" (PyCon DE):<br>one agent, two tools, a markdown workflow — and an eval suite of real user<br>journeys that catches the bug manual testing never will. Then the whole<br>thing runs on a disposable Linux box in 6.2 seconds.

uv run python -m evals.run_evals

"Unlike unit tests, evals are an emerging art/science. Anyone who claims to<br>know exactly how your evals should be defined can safely be ignored."

— Pydantic Evals documentation, quoted on stage at PyCon DE & PyData 2026

01 · the architecture

Seven myths, deleted.

Every anti-lesson from the talk maps to a concrete artifact in this repo.<br>The architecture is the absence of architecture.

"We need a multi-agent system"

One agent. Built one, deleted it.

→ agent/triage_agent.py — exactly one Agent

"Agents need sophisticated planning"

A numbered list beat the workflow engine.

→ the workflow is six markdown bullet points

"Give the agent lots of specific tools"

Two high-level tools replaced dozens.

→ search_runbooks + get_build_context

"Encode workflows in code"

Markdown the agent reads at runtime won.

→ agent/workflows/triage.md

"It works when I test it"

Simple tests ≠ real user journeys.

→ evals/dataset.py — vague, angry, ambiguous cases

"Automate everything"

Human in the driver's seat, not the trunk.

→ EscalationPolicy evaluator, asserted in CI

"Apply what made you successful before"

Deterministic checks first. LLM judge only where code can't grade.

→ evals/evaluators.py vs LLMJudge (live mode)

02 · the centerpiece

The failing row is the point.

The offline model stub ships a deliberate bug: on a vague report with no<br>log, it confidently guesses instead of escalating. Six realistic journeys, every run:

triage-agent[offline] — pydantic_evals report

Evaluation Summary: triage-agent[offline]<br>┌──────────────────────────┬─────────────────────────┬────────────┬──────────┐<br>│ Case ID │ Scores │ Assertions │ Duration │<br>├──────────────────────────┼─────────────────────────┼────────────┼──────────┤<br>│ oom_linker_crash │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 58.3ms │<br>│ stale_cache_poisoning │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 61.0ms │<br>│ flaky_integration_test │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 57.8ms │<br>│ toolchain_version_drift │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 59.2ms │<br>│ vague_angry_no_log │ CategoryMatches: 0.50 │ ✔✗✗✔ │ 58.3ms │<br>│ ambiguous_segfault │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 56.1ms │<br>├──────────────────────────┼─────────────────────────┼────────────┼──────────┤<br>│ Averages │ CategoryMatches: 0.917 │ 91.7% ✔ │ 58.2ms │<br>└──────────────────────────┴─────────────────────────┴────────────┴──────────┘

PASS: assertion pass rate 91.7% (threshold 60%)

Failing cases:<br>- vague_angry_no_log: EscalationPolicy, NoDestructiveGuessing

"The build is broken AGAIN!!! Just fix it." — no log, no build id.<br>The agent answered "probably an infrastructure hiccup, restart the agent pool"<br>at 0.9 confidence. A happy-path manual test never executes this journey.<br>The EscalationPolicy evaluator does — every single run. That's the difference<br>between testing and seeing.

03 · eval design

Assertions gate. Scores trend.

Expectations live in case metadata, not expected outputs —<br>the dataset stays declarative and partial credit is possible.

Cases

Six journeys developers actually have at 2am — including the vague and the<br>angry ones. metadata carries expected_category,<br>must_escalate, fix_keywords.

Deterministic evaluators

CategoryMatches (1.0 / 0.5 / 0.0), FixMentions,<br>and two policy assertions: EscalationPolicy,<br>NoDestructiveGuessing.

LLM judge (live)

LLMJudge grades what code can't: is the fix concrete,<br>runnable, and supported by the evidence? Rubric-driven, model-graded.

CI gate

Exit code from pass rate vs EVAL_MIN_PASS. Honest threshold:<br>a suite at 100% from day one is measuring nothing.

04 · three ways to run

From zero keys to a fleet.

Same entrypoint, three execution tiers. Mode auto-detects from the environment.

no api key<br>Offline

A FunctionModel stub plays the LLM — calls tools, emits structured<br>output, deterministic. The eval harness is fully exercised for free.

# 10 seconds, zero keys<br>$ uv sync<br>$ uv run python -m evals.run_evals<br>cost $0 · runs anywhere, incl. CI

anthropic key<br>Live

Real claude-haiku-4-5 drives the agent; an LLMJudge<br>evaluator joins the panel to grade fix quality against a rubric.

$ export ANTHROPIC_API_KEY=…<br>$ uv run python -m evals.run_evals<br># auto-switches to live mode<br>adds LLMJudge · model-graded rubric

disposable box<br>Remote · crabbox

Lease a box, rsync the dirty checkout, bootstrap uv, run, stream the<br>report, release. Swap providers — local container to real fleet.

$ crabbox job run evals<br># or explicitly:<br>$ crabbox run -provider apple-container \<br>-- bash scripts/run_evals_remote.sh<br>measured cold run 6.2s...

evals agent categorymatches from pydantic crabbox

Related Articles