We built an agent for CI despite using Claude Code for everything else

We use Claude Code daily. We still built our own CI agent. | Mendral<br>Last month, a team running 200K+ CI jobs per week asked us why they shouldn't just point Claude Code at their failing builds. Fair question. We use Claude Code every day. After watching Mendral close 16,000+ CI investigations a month autonomously, here's why a specialist agent outperforms a generalist one, even when both run on the same Anthropic models.

Why we're building this

Coding agents are great for shipping features fast. They're terrible for CI.

Teams adopting AI coding tools are seeing significantly more CI activity. More PRs, more test runs, more failures surfacing. Pipelines are slower because there's more code being tested. Flaky tests that were annoying at 10 engineers become a tax on everyone's productivity at 100. The engineers generating all that code with Copilot and Claude Code aren't the ones debugging the CI failures. They've already moved on.

We spent a decade building and scaling CI systems at Docker and Dagger. The work was always the same: stare at logs, correlate failures, figure out what changed. Mendral is the agent we wished we'd had.

Specialist vs. generalist

Claude Code is a generalist software engineer. Mendral is a specialist. Despite running on the same Anthropic models, Mendral consistently outperforms Claude Code at diagnosing and fixing CI failures, because the useful signal isn't in the code.

When a CI job fails, the signal is in the logs from this run, the logs from the last 50 runs, the test execution history, the failure patterns across branches, and the infrastructure conditions at the time of execution. Claude Code doesn't have access to any of that.

We built a log ingestion pipeline that processes billions of CI log lines per week into ClickHouse, compressed at 35:1 and queryable in milliseconds. Our agent writes its own SQL queries to investigate failures. A typical investigation scans 335K rows across 3+ queries. At P95, it scans 940 million rows. The agent can trace a flaky test back to a dependency bump three weeks ago by correlating across hundreds of CI runs at once, something no human would have the patience to do.

The whole implementation is ours, from the system prompt to every tool. Our agent can grab specific logs from a run, query historical failure rates across months, trace which commit introduced a regression, check if a test has been flaky on other branches, and cross-reference all of this in seconds. Claude Code can't, because it doesn't have the tools or the data.

One agent to the customer, a team of agents behind the scenes

From the outside, Mendral is one agent. You install a GitHub App, it joins your Slack, and it starts investigating CI failures. Internally, it's a team of specialized agents coordinating through our Go backend.

We use all three Anthropic tiers (Haiku, Sonnet, Opus). Using the wrong model for a task is either wasteful or insufficient.

Opus handles root cause analysis and implementation. When the agent forms a hypothesis about why a test is failing, reasons about complex interactions between test suites, or writes a non-trivial fix that touches CI configuration and test code at once, Opus takes over. The cost is higher. For root cause work, the quality justifies it.

Sonnet collects facts and deduplicates issues. It reads logs, writes SQL queries, gathers evidence from the repository, and correlates failures with code changes. Sonnet is the right balance of intelligence and cost for structured, evidence-gathering work.

Haiku handles log parsing and data extraction: classifying failure types, formatting structured output, extracting relevant snippets from raw logs. The solution space is constrained and we need throughput. We process thousands of these per day.

Routing is something we keep iterating on. Work that required Sonnet six months ago sometimes runs fine on Haiku today, so we re-evaluate model assignments regularly. A full investigation might involve a dozen sub-agent calls across all three tiers.

The agent loop

Our agent loop runs on our Go backend. We don't use LangChain, LangGraph, or any off-the-shelf agent framework. We need full control over execution, concurrency, and failure handling.

The core loop is straightforward: the agent receives a trigger (a CI failure, a Slack message, a scheduled analysis), assembles context, makes an LLM call, processes tool calls, and iterates until it reaches a conclusion or exhausts its budget.

Some tools are pure Go functions. Querying ClickHouse, fetching GitHub metadata, looking up repository structure, checking PR status. These are fast, deterministic operations that don't need isolation. They run in-process.

Some tools require a sandbox. When the agent needs to clone a repository, run tests, apply patches, or execute arbitrary code to validate a fix, it needs an isolated environment. We provision Firecracker microVMs on Blaxel for this. Each sandbox is a lightweight VM with its own kernel,...

We built an agent for CI despite using Claude Code for everything else

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play