What we learned using AI agents to refactor a monolith

What we learned using AI agents to refactor a monolith | 1PasswordSkip to Main Content

by Nancy Wang, Wayne Duso, K.J. Valencik April 20, 2026 - 9 min

Related Categories AI Developers

AI agents are increasingly used to refactor large codebases, but many teams lack a clear understanding of where they succeed and where they fail. At 1Password, we applied agentic tooling to a multi-million-line Go monolith, and in this blog we'll share what worked, what broke, and what it means for teams adopting AI in production systems. Here’s the situation: 1Password runs a large Go monolith called B5. It has been the foundation of our product for years and continues to perform well in production, both in terms of reliability and scale. Now, Unified Access is designed to support both human and agent-driven workflows at high request rates and low latency. As we continue adding and enhancing its capabilities, we need clearer service boundaries and more independent scaling characteristics. Over time, that means evolving parts of the system in a way that preserves the privacy, performance, reliability, and security properties we have already established. Coming up with an actionable plan for tackling this problem sounded like a good job for agents. In our case, this meant applying agentic refactoring: using AI agents to analyze, plan, and execute changes across a codebase, from dependency mapping to system decomposition. There’s a version of this story where agentic tooling analyzes a large codebase, produces a clean extraction plan, and service decomposition follows a predictable path from there. Parts of that story did play out as expected. We built an agentic toolchain that analyzed millions of lines of code and gave us a clear, defensible extraction order, and that work has meaningfully improved how we think about decomposing the system. What ended up being more valuable, though, was what we learned once we applied those tools to real changes in a live production environment. That is the part that tends to get glossed over, and it is the part that actually determines whether this approach works. Building the analysis layer The first question we had to answer was sequencing. In a system that handles sensitive data at scale, extraction order is a correctness constraint. If you get the sequence wrong, you can introduce subtle failures that are difficult to detect and even harder to unwind later. To make that problem tractable, we built an agentic toolchain that combined a few different sources of truth. We used Go SSA analysis to understand code structure, SQL parsing to identify data dependencies, and a DataDog MCP integration to bring in runtime coupling data. Together, these gave us a domain ownership map, a coupling graph, and a prioritized extraction order. The output largely matched what you would expect from experienced engineers looking at the system. It suggested starting with Vault, which has its own API, dataset, and security boundary, followed by Billing, then AuthN and AuthZ, with Identity remaining as the core. One pattern that worked especially well was using agents to build deterministic tooling rather than relying on them for ongoing interpretation. In this case, agents helped write parts of the SSA analyzer, and the analyzer then produced a reproducible domain map. That distinction matters because once the tool exists, you are reasoning over a stable artifact rather than debating what the model believes the system looks like. An unexpected benefit of this work was that the instrumentation we added to support the analysis also improved our end to end transaction visibility in DataDog, which has been useful beyond this project. Finding the human to agent ratio In parallel with the extraction analysis, we applied the same approach to a long-standing cleanup task in the codebase. Our Go server used MustBegin to start database transactions, which panics on failure. That behavior made sense early on because it surfaced database issues quickly during development, but at production scale it is not the behavior you want when connections time out or request contexts are cancelled. In those cases, returning a clean error is the correct outcome. The migration required updating more than 3,000 call sites across production and test code, which is why it had been sitting in the backlog. The approach we took was highly structured. We generated a deterministic manifest of every call site using SSA, classified those sites into a small number of patterns, and defined explicit templates for each one. From there, we wrote a detailed playbook that described exactly how agents should execute the migration, including a list of common failure modes and clear instructions on when to stop and escalate instead of guessing. To scale execution, we ran multiple agents in parallel using git worktrees so that changes remained isolated. Execution itself took a matter of hours. The majority of the time was spent building the tooling and...

What we learned using AI agents to refactor a monolith

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast