Evals or You're Blind — Pydantic AI × Pydantic Evals × crabbox
pydantic ai × pydantic evals × crabbox
Realistic evals,<br>or you're blind.
A runnable demo built from the seven anti-lessons of<br>"7 Anti-Lessons from Building a Pydantic AI Agent" (PyCon DE):<br>one agent, two tools, a markdown workflow — and an eval suite of real user<br>journeys that catches the bug manual testing never will. Then the whole<br>thing runs on a disposable Linux box in 6.2 seconds.
uv run python -m evals.run_evals
"Unlike unit tests, evals are an emerging art/science. Anyone who claims to<br>know exactly how your evals should be defined can safely be ignored."
— Pydantic Evals documentation, quoted on stage at PyCon DE & PyData 2026
01 · the architecture
Seven myths, deleted.
Every anti-lesson from the talk maps to a concrete artifact in this repo.<br>The architecture is the absence of architecture.
"We need a multi-agent system"
One agent. Built one, deleted it.
→ agent/triage_agent.py — exactly one Agent
"Agents need sophisticated planning"
A numbered list beat the workflow engine.
→ the workflow is six markdown bullet points
"Give the agent lots of specific tools"
Two high-level tools replaced dozens.
→ search_runbooks + get_build_context
"Encode workflows in code"
Markdown the agent reads at runtime won.
→ agent/workflows/triage.md
"It works when I test it"
Simple tests ≠ real user journeys.
→ evals/dataset.py — vague, angry, ambiguous cases
"Automate everything"
Human in the driver's seat, not the trunk.
→ EscalationPolicy evaluator, asserted in CI
"Apply what made you successful before"
Deterministic checks first. LLM judge only where code can't grade.
→ evals/evaluators.py vs LLMJudge (live mode)
02 · the centerpiece
The failing row is the point.
The offline model stub ships a deliberate bug: on a vague report with no<br>log, it confidently guesses instead of escalating. Six realistic journeys, every run:
triage-agent[offline] — pydantic_evals report
Evaluation Summary: triage-agent[offline]<br>┌──────────────────────────┬─────────────────────────┬────────────┬──────────┐<br>│ Case ID │ Scores │ Assertions │ Duration │<br>├──────────────────────────┼─────────────────────────┼────────────┼──────────┤<br>│ oom_linker_crash │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 58.3ms │<br>│ stale_cache_poisoning │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 61.0ms │<br>│ flaky_integration_test │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 57.8ms │<br>│ toolchain_version_drift │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 59.2ms │<br>│ vague_angry_no_log │ CategoryMatches: 0.50 │ ✔✗✗✔ │ 58.3ms │<br>│ ambiguous_segfault │ CategoryMatches: 1.00 │ ✔✔✔✔ │ 56.1ms │<br>├──────────────────────────┼─────────────────────────┼────────────┼──────────┤<br>│ Averages │ CategoryMatches: 0.917 │ 91.7% ✔ │ 58.2ms │<br>└──────────────────────────┴─────────────────────────┴────────────┴──────────┘
PASS: assertion pass rate 91.7% (threshold 60%)
Failing cases:<br>- vague_angry_no_log: EscalationPolicy, NoDestructiveGuessing
"The build is broken AGAIN!!! Just fix it." — no log, no build id.<br>The agent answered "probably an infrastructure hiccup, restart the agent pool"<br>at 0.9 confidence. A happy-path manual test never executes this journey.<br>The EscalationPolicy evaluator does — every single run. That's the difference<br>between testing and seeing.
03 · eval design
Assertions gate. Scores trend.
Expectations live in case metadata, not expected outputs —<br>the dataset stays declarative and partial credit is possible.
Cases
Six journeys developers actually have at 2am — including the vague and the<br>angry ones. metadata carries expected_category,<br>must_escalate, fix_keywords.
Deterministic evaluators
CategoryMatches (1.0 / 0.5 / 0.0), FixMentions,<br>and two policy assertions: EscalationPolicy,<br>NoDestructiveGuessing.
LLM judge (live)
LLMJudge grades what code can't: is the fix concrete,<br>runnable, and supported by the evidence? Rubric-driven, model-graded.
CI gate
Exit code from pass rate vs EVAL_MIN_PASS. Honest threshold:<br>a suite at 100% from day one is measuring nothing.
04 · three ways to run
From zero keys to a fleet.
Same entrypoint, three execution tiers. Mode auto-detects from the environment.
no api key<br>Offline
A FunctionModel stub plays the LLM — calls tools, emits structured<br>output, deterministic. The eval harness is fully exercised for free.
# 10 seconds, zero keys<br>$ uv sync<br>$ uv run python -m evals.run_evals<br>cost $0 · runs anywhere, incl. CI
anthropic key<br>Live
Real claude-haiku-4-5 drives the agent; an LLMJudge<br>evaluator joins the panel to grade fix quality against a rubric.
$ export ANTHROPIC_API_KEY=…<br>$ uv run python -m evals.run_evals<br># auto-switches to live mode<br>adds LLMJudge · model-graded rubric
disposable box<br>Remote · crabbox
Lease a box, rsync the dirty checkout, bootstrap uv, run, stream the<br>report, release. Swap providers — local container to real fleet.
$ crabbox job run evals<br># or explicitly:<br>$ crabbox run -provider apple-container \<br>-- bash scripts/run_evals_remote.sh<br>measured cold run 6.2s...