Estimating the Productivity of an Autonomous AI Software Engineer | Cognition<br>Menu
Close
Six months ago, every CTO was concerned their team wasn't using enough tokens. That trend has reversed as token usage and AI spend have skyrocketed. Engineering leaders are now trying to figure out how to measure actual output, because not every token delivers real value. Some save engineering hours and accelerate projects; others are wasted on useless sessions and bad prompting.
For any organization of sufficient scale, it's essentially impossible to measure value over thousands of sessions & billions of tokens. We set out to automate this with AI.
Predicting real ROI is hard, for reasons we’ll go into later. So we focused on a key sub-task: estimating how many productive engineering hours each Devin session is worth.
LLMs are notoriously bad at time estimates — so we expected this to be a struggle. After carefully tuning the system, however, our model has an rlogr_{log}rlog of 0.740.740.74 and appears to be unbiased. Individual predictions aren’t perfect, but the model is good enough to be used for estimating aggregated totals. They’re also convertible to dollar amounts using engineering salaries, getting us closer to business value.
In our system, an agent reviews each completed Devin session — first classifying whether it produced useful output, then estimating how long a human engineer would have taken to produce the same work. We validated it by asking human engineers how long they would have spent on the same tasks.
The system is now running with customers. To our knowledge, this is the first automated system measuring AI engineering productivity in production.
Choosing a Metric<br>How did we land on productive engineering hours as our metric? The first question we needed to answer was what to estimate. Ideally, we'd measure dollar impact directly, such as revenue attributable to features shipped or costs avoided by bugs fixed. In practice, this is still an unsolved problem in our field. It’s incredibly hard for an engineer to know how many dollars of business value they created through the PRs shipped last week.
On the other end of the spectrum, we could measure raw activity: lines of code, commits, PRs, tokens consumed. These are easy to collect but don't correspond to effort. A mechanical refactor can touch thousands of lines in an afternoon; a two-line bug fix can represent hours of investigation. Many valuable tasks — triaging bugs, running analytics queries, reviewing code — produce no code at all.
The middle ground we decided to measure is human engineering hours: how long would a human engineer have taken to produce the same output? Hours are already how organizations value engineering work — salaries and contractor rates are denominated in time. When leadership evaluates an investment, they think in terms of time and cost savings. Hours are standardized across organizations, independent of business context, and convertible to dollars via engineering rates.
But not all hours are equal. For example, if all PRs created by a session were closed, it likely wasn’t valuable. We wanted to measure only productive engineering hours. So, we also had to build a system to classify whether the sessions were actually productive.
Collecting a Dataset<br>We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258258258 sessions from 126126126 users across a diverse set of enterprise customers. We collected the data via live interviews and a survey.
Every Devin session has a full execution trace: the user's request, every action taken, the resulting code, codebase context. This gives us a record of production engineering work at a level of detail that is difficult to obtain from surveys, aggregate activity metrics, or open-source benchmarks alone.
In the charts below we analyze our dataset. Our dataset consists of a distribution representing real enterprise workloads, spanning a variety of languages, frameworks, session types, and hour estimates.
Filtering for Useful Work<br>As we reviewed our dataset, we realized that not all sessions correspond to useful work, and that we would need to build a classifier to filter out unproductive sessions.
For sessions with a PR, this is relatively straightforward: if any PR from the session is merged, we include the estimate; if not, we discard it. This is slightly lossy; sessions with all closed PRs can still have delivered productive work, but we wanted to err conservative.
Sessions without a PR are more complicated. We built a classifier to filter out unproductive sessions, which removes around 1−20%1-20\%1−20% of sessions, depending on the customer. Many non-PR sessions are genuinely productive — e.g., finding unused dependencies, scanning for security vulnerabilities, reviewing a pull request, running analytics queries, triaging...