What "done" means when you're shipping AI features
What "done" means when you're shipping AI features
Skip to content
What "done" means when you’re shipping AI features
Posted on
June 1, 2026
Ah, the good old days of sprint reviews.Engineering says they shipped [something]. All tests passed. No P0 bugs. The demo worked like a charm. "Works as designed" had been achieved once again. Sadly, we don’t have the luxury of this simplicity any more (if we ever did).
The definition of "done" has always been the problem. For AI, it has to be completely rewritten.
We built our definitions of done for deterministic software.
Inputs go in, outputs come out, and the output is the same on Tuesday as it was on Monday. That was the whole point. We were building predictable user experiences. QA tested the edges, engineering wrote the unit tests, the spec described the behavior, and "done" was the moment all three lined up. That worked because the software behaved like a vending machine. Put the money in, press the button, and get the soda.
AI features don’t behave like vending machines.
The same prompt produces different outputs across users, sessions, model updates, and contexts your QA team never imagined. The thing you shipped isn’t a fixed, consistent and predictable object . It’s a distribution of behaviors. Some of them you’ve seen and can predict. Others, you haven’t. Most of the surprises live in the ones you haven’t. Your test suite was designed to verify the properties of your code. It was not designed to verify the experience your customer is having at 11pm in Auckland with a question your spec didn’t anticipate.
"All tests pass" is still important and means the same thing it always did. It just no longer means done.
Now, with AI, done is a calibration about an acceptable variance in output and more importantly, behavior, not a binary result about specification adherence. You define done by the range of conditions under which the feature behaves well enough , the failure modes you’ve decided to tolerate, the monitoring you’ve put in place for the ones you haven’t, and the rollback you’re ready to execute and that you’ve actually rehearsed.
There’s a lot in that last sentence so let me break it down into three steps to take to create an AI-ready definition of done.
Step 1: Write acceptance criteria as distributions, not assertions
"When the user asks or inputs X, the system returns Y." That’s an assertion. It’s verifiable because it’s binary. For an AI feature it’s often meaningless.
Try this framing instead: "For 80% of inputs in category X, the system returns a response that meets quality bar Y; for the remaining 20%, the failure mode is degraded but not embarrassing." That’s a distribution. It’s harder to write, harder to test, and infinitely more honest about what you’re actually shipping.
You don’t need every criterion to be probabilistic. You need to be honest with yourself and the team about which success outcomes are which and most importantly, stop pretending that the probabilistic ones are deterministic (i.e., fixed and predictable). The teams I see getting this right have two sets of criteria on the same feature:
The deterministic ones for the parts of the system that still behave like vending machines (e.g., auth, billing, navigation)
The distributional (or probabilistic) ones for the parts that don’t (e.g., system feedback, context-specific responses, custom user requests)
Step 2: Build the failure triage capacity into the launch, not after
Traditionally teams shipped the feature, watched the dashboard, and triaged when complaints surfaced at a level worth reacting to. For AI features that workflow is too slow. By the time the complaint reaches engineering, the user has formed an opinion about your product.
To support AI products, write the triage playbook before the AI feature launches. Be clear who owns model-quality issues, UX issues content issues and, perhaps, any PR disasters the model may inadvertently trigger. The launch isn’t just the feature going live. The launch is the moment the team that will field the consequences is actually ready to handle them. This is still not the case in most teams I work with, especially in the enterprise.
You are not done when the feature ships. You are done when the people downstream of the feature know what to do when it misbehaves. And it will, because it’s AI.
Step 3: Define the signal that means "this isn’t done after all" and rehearse the rollback
Every AI feature should ship with a tripwire. You can choose whatever metric makes the most sense for your product and context (and brand). It could be error rate, off-tone complaints per thousand sessions, escalations to human review, hallucination rate flagged by a sample audit, anything that is an indicator of a sub-optimal user experience. Then, set the threshold, and decide before launch what you do when it’s crossed.
Once the wire is tripped, what do you do? Not theoretically,...