Why Traditional Testing Doesn't Work for AI Applications

kigster2 pts0 comments

Evals: The Unit Tests for the Non-Deterministic Parts of Your App — kig.re<br>// Evals To the Rescue<br>Evals: The Unit Tests for the Non-Deterministic Parts of Your App<br>June 22, 2026 Konstantin Gredeskoul<br>For about twenty-five years, my mental model of “is this code correct?” was simple<br>and comforting: feed it a known input, assert on a known output, watch the dot turn<br>green. 2 + 2 had better be 4 every single time, or someone has done something<br>unspeakable to my computer.

Then I started wiring language models into real applications, and that comforting<br>model quietly fell apart. You send the same prompt twice and get two different<br>answers. Both might be correct. One might be subtly, expensively wrong. And there’s<br>no exception, no stack trace, no red dot — just a confident paragraph of text that<br>happens to be hallucinated nonsense.

So how do you test the part of your app that refuses to be deterministic?

You write evals . This post is about what they are, why they suddenly matter, and<br>how to write one — with a tiny but real Ruby app and an eval harness that tests it<br>from end to end. No ML PhD required. If you can write an RSpec test, you can write an<br>eval.

What’s actually new here?

Let’s be precise about what changed, because the hype machine is bad at this.

For a normal function, the mapping from input to output is fixed. slugify("Hello World") returns "hello-world" today, tomorrow, and on the heat-death afternoon of the universe. Your test pins that mapping in place, and any future change that breaks it turns the dot red.

A language model is not a function in this sense. It’s a sample from a probability<br>distribution over possible responses. Ask Claude to classify a customer message and<br>it’ll usually give you the right label — but “usually” is doing a lot of work in that<br>sentence, and the exact words, formatting, and edge-case behavior will drift:

when you reword the prompt,

when you upgrade the model,

when the input is slightly weird in a way you didn’t anticipate.

NOTE

This is the uncomfortable part for those of us with grey in our hair: your test<br>suite can be 100% green and your feature can still be getting worse. The<br>determinism we leaned on for decades doesn’t extend past the API boundary into the<br>model. Evals are how we get a measurement back, together with the confidence.

An eval (short for evaluation) is a test for non-deterministic, model-driven<br>behavior. Instead of asserting “output == expected,” it asks a softer but far more<br>useful question: across a representative set of inputs, how often does the model do<br>the right thing — and is that good enough to ship?

If unit tests pin behavior, evals measure it. That shift — from a boolean to a<br>score with a threshold — is the whole idea.

Why you can’t skip them

I know the temptation, because I’ve given in to it. You paste a clever prompt into<br>the playground, try four or five examples, they all look great, you ship it, you move<br>on. The LLM equivalent of “works on my machine.”

Here’s what that costs you later:

No regression detection. You tweak the prompt to fix one annoying edge case.<br>Did you just quietly break the three things that were working? Without an eval, you<br>genuinely do not know. You’re flying blind and calling it confidence.

No safe model upgrades. A new, cheaper, faster model comes out — and they come<br>out constantly now. Is it as good as the one you have in production for your<br>task? “The benchmarks look great” is not an answer. The benchmarks aren’t running<br>your prompt against your data. Your eval is.

No shared definition of “good.” On a team, “the bot feels worse this week” is a<br>vibe, not a bug report. An eval turns that vibe into a number everyone can see, and<br>a number you can argue with is infinitely more useful than a feeling you can’t.

An eval is the seatbelt that lets you actually iterate on an AI feature instead of<br>freezing it in fear the moment it sort-of works. And iteration is the entire game.

A tiny, real app to test

Enough philosophy. Let’s build something small enough to read in one sitting and real<br>enough to be worth testing.

My wife runs a tax-prep practice, and I’m building a little service to qualify the<br>leads that come in through her website’s contact form. So our example is a lead<br>qualifier : given a free-text message a stranger typed into a form, sort it into an<br>intent bucket so the humans know who to call back first.

It’s a perfect specimen for this discussion — it’s genuinely useful, it’s the kind of<br>thing every business actually wants, and it’s exactly the sort of “soft” task that<br>used to require a fragile pile of regexes and now takes one good prompt.

I’m doing this in plain Ruby (no Rails), because it fits my stack and because a single<br>file you can ruby is the best kind of example. One gem:

# Gemfile<br>source "https://rubygems.org"

gem "anthropic" # the official Anthropic Ruby SDK<br>bundle install<br>export ANTHROPIC_API_KEY="sk-ant-..." # get one at console.anthropic.com<br>The qualifier

The whole feature...

model eval evals prompt test tests

Related Articles