Build Your Own Eval Harness from Scratch with Bun and Claude -p

speckx1 pts0 comments

Build Your Own Eval Harness from Scratch with Bun and claude -p | alexop.dev

Next Talk: Automating Web Development with Claude Code<br>July 1, 2026 — DWX Developer World, Mannheim

Conference

(history.length === 1) ? window.location = '/' : history.back())()"> Go back<br>Build Your Own Eval Harness from Scratch with Bun and claude -p<br>Published:Jun 14, 2026 at<br>building your own coding agent from scratch.\n\n## Setup: Bun and the claude CLI\n\nTwo prerequisites, both one-liners:\n\n```bash\n# 1. Bun — the runtime that runs our harness\ncurl -fsSL https://bun.sh/install | bash\n\n# 2. The Claude Code CLI — the agent we're testing, and our judge\nnpm install -g @anthropic-ai/claude-code\n\n# sanity check: this should print a model's reply\nclaude -p \"say hi in three words\" --output-format json\n```\n\nThe key flag we'll lean on is `--output-format json`, which makes the CLI print one machine-readable envelope instead of a stream of human text. Make a folder, drop in an empty `evals.ts`, and let's fill it.\n\n## Step 1: drive the agent from code\n\nFirst, a function that runs the agent on a prompt and hands back its reply. We shell out to `claude -p` (the \"print\" / non-interactive mode) and parse the JSON envelope it prints. That envelope carries the final text in `result`, the dollar cost in `total_cost_usd`, and an `is_error` flag.\n\n```typescript\n// evals.ts\nimport { spawnSync } from \"bun\";\n\n// Run the agent on `prompt` inside `cwd`; return its final reply.\nfunction runAgent(prompt: string, cwd: string) {\n const res = spawnSync({\n cmd: [\n \"claude\", \"-p\", prompt,\n \"--output-format\", \"json\", // one JSON envelope on stdout\n \"--permission-mode\", \"bypassPermissions\", // don't prompt us mid-run\n \"--max-budget-usd\", \"0.50\", // hard safety cap per run\n ],\n cwd,\n stdout: \"pipe\",\n stderr: \"pipe\",\n timeout: 180_000,\n });\n\n const envelope = JSON.parse(res.stdout.toString());\n return {\n text: envelope.result ?? \"\",\n ok: res.exitCode === 0 && envelope.is_error !== true,\n cost: Number(envelope.total_cost_usd ?? 0),\n };\n}\n```\n\n\nEvals are slow (each one is a real model call) but we want them simple and ordered, not a clever async pipeline. Synchronous spawning keeps the whole harness readable top-to-bottom. You can parallelize later; correctness first.\n\n\n## Step 2: give it a sandbox to act in\n\nLetting an agent loose in your real repo is a bad idea, and it makes runs non-repeatable. Instead, every case gets a fresh throwaway git repo seeded with the files that behavior needs, a fixture. When the run is done, you can inspect or delete it.\n\n```typescript\nimport { mkdtempSync, mkdirSync, writeFileSync } from \"node:fs\";\nimport { tmpdir } from \"node:os\";\nimport { join, dirname } from \"node:path\";\n\n// Make a throwaway git repo seeded with `files`; return its path.\nfunction makeSandbox(files: Record) {\n const dir = mkdtempSync(join(tmpdir(), \"eval-\"));\n spawnSync({ cmd: [\"git\", \"init\", \"-q\"], cwd: dir });\n for (const [path, content] of Object.entries(files)) {\n const target = join(dir, path);\n mkdirSync(dirname(target), { recursive: true });\n writeFileSync(target, content);\n }\n return dir;\n}\n```\n\n\nIf you're testing \"does it notice the plan already exists,\" you need one file, a fake `docs/plans/checkout.md`, not a clone of your codebase. Small fixtures isolate the behavior and run fast.\n\n\n## Step 3: grade with cheap, deterministic checks\n\nNow the grading. Start with the cheapest tool that captures the behavior: plain string and file checks. They're free, instant, and never flaky. Reach for the LLM judge only for what these can't express.\n\n```typescript\nimport { existsSync } from \"node:fs\";\n\nconst has = (haystack: string, needle: string) =>\n haystack.toLowerCase().includes(needle.toLowerCase());\n\ntype Checks = {\n required_substrings?: string[]; // must appear in the reply\n forbidden_substrings?: string[]; // must NOT appear\n required_files?: string[]; // must exist in the sandbox after the run\n};\n\n// Returns [label, passed] for each check.\nfunction checkAssertions(checks: Checks, reply: string, dir: string) {\n const out: [string, boolean][] = [];\n for (const s of checks.required_substrings ?? [])\n out.push([`contains \"${s}\"`, has(reply, s)]);\n for (const s of checks.forbidden_substrings ?? [])\n out.push([`excludes \"${s}\"`, !has(reply, s)]);\n for (const f of checks.required_files ?? [])\n out.push([`created ${f}`, existsSync(join(dir, f))]);\n return out;\n}\n```\n\n\nIf a behavior has two halves, say, \"records the repo's path and its origin URL,\" write two checks, not one. Otherwise a half-right answer passes. One requirement, one assertion.\n\n\n## Step 4: grade fuzzy behavior with an LLM judge\n\nSome behaviors have no keyword. \"Did it read the repo before asking its first question?\" \"Did it explain the trade-off?\" For those, you hand the reply to a second, cheaper model and ask it to grade each...

string from claude const checks reply

Related Articles