I built an agentic coding harness across three CLI hosts

I Built an Agentic Coding Harness Across Three CLI hosts. Here’s How It Works | by Caspar Bannink | May, 2026 | Towards AISitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Towards AI

We build Enterprise AI. We teach what we learn. Join 100K+ AI practitioners on Towards AI Academy. Free: 6-day Agentic AI Engineering Email Guide: https://email-course.towardsai.net/

I Built an Agentic Coding Harness Across Three CLI hosts. Here’s How It Works

Caspar Bannink

8 min read· May 13, 2026

Listen

This article is a work in progress. I will keep updating it as the kit evolves. Last spring, an agent rebuilt my email-templating system for the third time. Same logic, different repo, no memory of the previous two attempts. The speed of vibecoding was getting taxed by the cost of the agent forgetting what I had already shipped. The fix was not a smarter prompt. It was a smaller vocabulary. I built agentic-coding-kit. 17 agents, 12 slash commands, 35+ PowerShell tools, a Playwright visual pipeline, and a memory system that compounds across sessions. It installs across Claude Code, OpenCode, and GitHub Copilot CLI from one canonical source. This post explains the full architecture. The core idea: slash commands are workflows, agents are leaves Anthropic’s own docs say it plainly: “the main agent writes code, edits files, and runs commands itself, dispatching subagents in the background.” Every successful open-source kit I surveyed follows this pattern. I learned it the hard way by building 21 wrapping orchestrator agents and killing them all. The kit has 12 commands: 8 workflows plus 4 utility commands (/bootstrap-harness, /kit-init, /wiki-init, /analyze). Command What it does Typical spawns `/build` Scope, explore, implement, review, verify 3-5 agents `/review` Surface, specialist, adversarial, false-positive verifier 4-9 agents `/investigate` Hypothesis-driven debugging, evidence collection 2-4 agents `/plan` Clarify scope, map files, stop for approval 1-2 agents `/refactor` Principle-driven restructuring with consequence tracing 2-4 agents `/redesign` Aesthetic lock, capture, per-component design, visual diff 4-8 agents `/security-review` Adversarial audit by attack class 3-6 agents `/analyze` Multi-angle research with claim verification 4-8 agents `/bootstrap-harness` Detect repo conventions, write to conventions.md 1-2 agents `/kit-init` Initialize .kit/context/ in a new repo 0 agents `/wiki-init` Bootstrap .wiki/ from code evidence 1-2 agentsThese commands run differently on each host. In Claude Code and OpenCode, they run as slash commands where the main session acts as orchestrator and spawns leaf agents via the Task tool. In Copilot CLI (which does not support custom slash commands or orchestrator-style spawning), each workflow is a shell script: kit-build.sh chains copilot --agent workflow-explorer -p "..." followed by copilot --agent workflow-implementer -p "..." as direct sequential commands. Different host architecture, same leaf agents, same phases. How the workflows actually work Every workflow starts the same way: scope-classifier.ps1 reads the git diff and classifies the task as ISOLATED (single file, 0 spawns), SHARED (multi-file, 3 spawns typical), or CRITICAL (auth, schema, migrations: 5 spawns, adversarial pass included). This determines the ceremony tier before anything else runs. Press enter or click to view image in full size

/build is the most common. Phase 1 spawns workflow-explorer to map files, trace dependencies, and return a structured brief (with wiki and specialist memory injected via resolvers). Phase 2 spawns workflow-implementer, which writes code using edit-with-lint.ps1 for every change (atomic syntax check, revert on failure). Phase 3 spawns workflow-reviewer (one reviewer, not seven, the adversarial pass only runs at CRITICAL tier). Phase 4 runs test-loop.ps1 and verify-writeback.ps1. If verification fails, the session cannot claim completion. This is the Iron Law. /investigate runs hypothesis-driven debugging. It collects symptoms, generates ranked hypotheses, runs the cheapest test first, collects evidence, and outputs a build brief. The output is a diagnosis, not a fix. If the diagnosis warrants code changes, it hands off to /build. /review fans out to specialist reviewers in parallel: code-quality-reviewer, security-reviewer, api-reviewer, testing-reviewer, and optionally performance-reviewer, adversarial-reviewer, data-migration-reviewer. A false-positive-verifier filters noise before the final report. Findings update handoffs.md (active items), memory.md (recurring patterns), and reflections.md (workflow improvements). /analyze is for research, not code. It spawns 4 explorers in parallel (architecture-explorer, surface-explorer, risk-explorer, ops-explorer), then 4 theorists (pragmatist, skeptic, security-reliability, product-wedge), then a claim-verifier that checks claims against actual code. Useful for "what is this codebase and...

I built an agentic coding harness across three CLI hosts

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine