Realistic Evals, or You're Blind

Evals or You're Blind — Pydantic AI × Pydantic Evals × crabbox

pydantic ai × pydantic evals × crabbox

Realistic evals, or you're blind.

A runnable demo built from the seven anti-lessons of "7 Anti-Lessons from Building a Pydantic AI Agent" (PyCon DE): one agent, two tools, a markdown workflow — and an eval suite of real user journeys that catches the bug manual testing never will. Then the whole thing runs on a disposable Linux box in 6.2 seconds.

uv run python -m evals.run_evals

"Unlike unit tests, evals are an emerging art/science. Anyone who claims to know exactly how your evals should be defined can safely be ignored."

— Pydantic Evals documentation, quoted on stage at PyCon DE & PyData 2026

01 · the architecture

Seven myths, deleted.

Every anti-lesson from the talk maps to a concrete artifact in this repo. The architecture is the absence of architecture.

"We need a multi-agent system"

One agent. Built one, deleted it.

→ agent/triage_agent.py — exactly one Agent

"Agents need sophisticated planning"

A numbered list beat the workflow engine.

→ the workflow is six markdown bullet points

"Give the agent lots of specific tools"

Two high-level tools replaced dozens.

→ search_runbooks + get_build_context

"Encode workflows in code"

Markdown the agent reads at runtime won.

→ agent/workflows/triage.md

"It works when I test it"

Simple tests ≠ real user journeys.

→ evals/dataset.py — vague, angry, ambiguous cases

"Automate everything"

Human in the driver's seat, not the trunk.

→ EscalationPolicy evaluator, asserted in CI

"Apply what made you successful before"

Deterministic checks first. LLM judge only where code can't grade.

→ evals/evaluators.py vs LLMJudge (live mode)

02 · the centerpiece

The failing row is the point.

The offline model stub ships a deliberate bug: on a vague report with no log, it confidently guesses instead of escalating. Six realistic journeys, every run:

triage-agent[offline] — pydantic_evals report

Evaluation Summary: triage-agent[offline] ┌──────────────────────────┬─────────────────────────┬────────────┬──────────┐ │ Case ID │ Scores │ Assertions │ Duration │ ├──────────────────────────┼─────────────────────────┼────────────┼──────────┤ │ oom_linker_crash │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 58.3ms │ │ stale_cache_poisoning │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 61.0ms │ │ flaky_integration_test │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 57.8ms │ │ toolchain_version_drift │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 59.2ms │ │ vague_angry_no_log │ CategoryMatches: 0.50 │ ✔✗✗✔ │ 58.3ms │ │ ambiguous_segfault │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 56.1ms │ ├──────────────────────────┼─────────────────────────┼────────────┼──────────┤ │ Averages │ CategoryMatches: 0.917 │ 91.7% ✔ │ 58.2ms │ └──────────────────────────┴─────────────────────────┴────────────┴──────────┘

PASS: assertion pass rate 91.7% (threshold 60%)

Failing cases: - vague_angry_no_log: EscalationPolicy, NoDestructiveGuessing

"The build is broken AGAIN!!! Just fix it." — no log, no build id. The agent answered "probably an infrastructure hiccup, restart the agent pool" at 0.9 confidence. A happy-path manual test never executes this journey. The EscalationPolicy evaluator does — every single run. That's the difference between testing and seeing.

03 · eval design

Assertions gate. Scores trend.

Expectations live in case metadata, not expected outputs — the dataset stays declarative and partial credit is possible.

Cases

Six journeys developers actually have at 2am — including the vague and the angry ones. metadata carries expected_category, must_escalate, fix_keywords.

Deterministic evaluators

CategoryMatches (1.0 / 0.5 / 0.0), FixMentions, and two policy assertions: EscalationPolicy, NoDestructiveGuessing.

LLM judge (live)

LLMJudge grades what code can't: is the fix concrete, runnable, and supported by the evidence? Rubric-driven, model-graded.

CI gate

Exit code from pass rate vs EVAL_MIN_PASS. Honest threshold: a suite at 100% from day one is measuring nothing.

04 · three ways to run

From zero keys to a fleet.

Same entrypoint, three execution tiers. Mode auto-detects from the environment.

no api key Offline

A FunctionModel stub plays the LLM — calls tools, emits structured output, deterministic. The eval harness is fully exercised for free.

# 10 seconds, zero keys $ uv sync $ uv run python -m evals.run_evals cost $0 · runs anywhere, incl. CI

anthropic key Live

Real claude-haiku-4-5 drives the agent; an LLMJudge evaluator joins the panel to grade fix quality against a rubric.

$ export ANTHROPIC_API_KEY=… $ uv run python -m evals.run_evals # auto-switches to live mode adds LLMJudge · model-graded rubric

disposable box Remote · crabbox

Lease a box, rsync the dirty checkout, bootstrap uv, run, stream the report, release. Swap providers — local container to real fleet.

$ crabbox job run evals # or explicitly: $ crabbox run -provider apple-container \ -- bash scripts/run_evals_remote.sh measured cold run 6.2s...

Realistic Evals, or You're Blind

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy