Why Traditional Testing Doesn't Work for AI Applications

Evals: The Unit Tests for the Non-Deterministic Parts of Your App — kig.re // Evals To the Rescue Evals: The Unit Tests for the Non-Deterministic Parts of Your App June 22, 2026 Konstantin Gredeskoul For about twenty-five years, my mental model of “is this code correct?” was simple and comforting: feed it a known input, assert on a known output, watch the dot turn green. 2 + 2 had better be 4 every single time, or someone has done something unspeakable to my computer.

Then I started wiring language models into real applications, and that comforting model quietly fell apart. You send the same prompt twice and get two different answers. Both might be correct. One might be subtly, expensively wrong. And there’s no exception, no stack trace, no red dot — just a confident paragraph of text that happens to be hallucinated nonsense.

So how do you test the part of your app that refuses to be deterministic?

You write evals . This post is about what they are, why they suddenly matter, and how to write one — with a tiny but real Ruby app and an eval harness that tests it from end to end. No ML PhD required. If you can write an RSpec test, you can write an eval.

What’s actually new here?

Let’s be precise about what changed, because the hype machine is bad at this.

For a normal function, the mapping from input to output is fixed. slugify("Hello World") returns "hello-world" today, tomorrow, and on the heat-death afternoon of the universe. Your test pins that mapping in place, and any future change that breaks it turns the dot red.

A language model is not a function in this sense. It’s a sample from a probability distribution over possible responses. Ask Claude to classify a customer message and it’ll usually give you the right label — but “usually” is doing a lot of work in that sentence, and the exact words, formatting, and edge-case behavior will drift:

when you reword the prompt,

when you upgrade the model,

when the input is slightly weird in a way you didn’t anticipate.

NOTE

This is the uncomfortable part for those of us with grey in our hair: your test suite can be 100% green and your feature can still be getting worse. The determinism we leaned on for decades doesn’t extend past the API boundary into the model. Evals are how we get a measurement back, together with the confidence.

An eval (short for evaluation) is a test for non-deterministic, model-driven behavior. Instead of asserting “output == expected,” it asks a softer but far more useful question: across a representative set of inputs, how often does the model do the right thing — and is that good enough to ship?

If unit tests pin behavior, evals measure it. That shift — from a boolean to a score with a threshold — is the whole idea.

Why you can’t skip them

I know the temptation, because I’ve given in to it. You paste a clever prompt into the playground, try four or five examples, they all look great, you ship it, you move on. The LLM equivalent of “works on my machine.”

Here’s what that costs you later:

No regression detection. You tweak the prompt to fix one annoying edge case. Did you just quietly break the three things that were working? Without an eval, you genuinely do not know. You’re flying blind and calling it confidence.

No safe model upgrades. A new, cheaper, faster model comes out — and they come out constantly now. Is it as good as the one you have in production for your task? “The benchmarks look great” is not an answer. The benchmarks aren’t running your prompt against your data. Your eval is.

No shared definition of “good.” On a team, “the bot feels worse this week” is a vibe, not a bug report. An eval turns that vibe into a number everyone can see, and a number you can argue with is infinitely more useful than a feeling you can’t.

An eval is the seatbelt that lets you actually iterate on an AI feature instead of freezing it in fear the moment it sort-of works. And iteration is the entire game.

A tiny, real app to test

Enough philosophy. Let’s build something small enough to read in one sitting and real enough to be worth testing.

My wife runs a tax-prep practice, and I’m building a little service to qualify the leads that come in through her website’s contact form. So our example is a lead qualifier : given a free-text message a stranger typed into a form, sort it into an intent bucket so the humans know who to call back first.

It’s a perfect specimen for this discussion — it’s genuinely useful, it’s the kind of thing every business actually wants, and it’s exactly the sort of “soft” task that used to require a fragile pile of regexes and now takes one good prompt.

I’m doing this in plain Ruby (no Rails), because it fits my stack and because a single file you can ruby is the best kind of example. One gem:

# Gemfile source "https://rubygems.org"

gem "anthropic" # the official Anthropic Ruby SDK bundle install export ANTHROPIC_API_KEY="sk-ant-..." # get one at console.anthropic.com The qualifier

The whole feature...

Why Traditional Testing Doesn't Work for AI Applications

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI