Unit Testing's Eval Twin

Unit Testing's Eval Twin - volary Skip to main content Loading volary...

If you've got a software background and are starting to look into integrating agentic AI into your solutions, it can be quite a paradigm shift. Agents are unpredictable black boxes, which can feel intractable at times. In this article, I attempt to draw on some intuition you may have about traditional testing, to write effective evals to produce predictable and well-aligned agents.

Typically your first stab at this might involve writing a prompt, and running the agent to see what happens, noticing some pathological behaviour, and iterating. This is expensive, time consuming, and manual. Just like testing software, you’re going to want to automate this pretty soon. Let's dive into how!

Unit evals

Just like our testing strategy, we think of evals as a pyramid. At the bottom, we have "unit" style evals. These test a single step in an agentic loop. Specifically, they take a transcript and assert that the agent's next step(s) are as we expect, typically by asserting that the agent calls certain tools with specific arguments. The goal here is to verify that the agent is following the instructions in your system prompt, rather than specifically verifying that these lead to the final desired outcome.

I'd recommend:

Think about where in the workflow you want the agent to use the tools provided to it, and provide examples of this in your system prompt.

Verify that the agent uses these tools by generating scenarios (transcripts), and assert that it indeed uses these tools.

In the context of our MCP tooling:

Does the agent read a key memory when starting a new task?

Does the agent continue to read relevant memories as the task evolves e.g. when moving from implementation to testing?

Does the agent report outdated memories back after completing a task, before responding to a user?

Just like unit tests, these are cheap, and enable you to rapidly iterate. If you have no evals at all, I highly recommend starting here. You'll at least have some evidence that your agent is following the instructions you gave it, if nothing else. However, be aware that you are imparting a lot of your assumptions onto the agent. You may have been wrong about what the best approach to the problem is. To begin verifying the final outcome, we need to take a step up the pyramid.

NB: This style of evals is extremely effective, especially for closed tasks and smaller models. However, frontier models often benefit from more freedom. This ties nicely into an ever-evolving debate of harness design, which we will cover in a future article.

Integration evals

The next rung on our pyramid is integration-style evals. At this level, we start introducing tool calling to test a full agentic loop. Our goal here is to test how the agent behaves with a given input, to better understand how the rest of our system needs to behave to set the agent up for success. We're trying to answer: assuming the rest of the system works, does our agent actually achieve its goal? This is where integration tests distinguish themselves from end-to-end tests, which operate on a production-like view of the world.

Use these tests to figure out how the rest of the system needs to work e.g. how should a coordinator agent present a task to a sub-agent to maximise success? What context does it need to include? These tests allow you to quickly identify what the shape of the output from one agent or tool needs to be, so you can iterate that tool/agent towards an output that works for this agent. You will quickly find that by putting a bit of effort into enriching or formatting the data you present to the agent, you will see big improvements in performance.

Additionally, these tests allow you to shake out a lot of assumptions you made at the unit level. Fakes (stub implementations) are highly effective here. These allow the agents to interact with a realistic version of the system. By capturing these side effects in these fakes, you can start to validate the end result of a series of steps towards achieving a task. By composing re-usable test facets together, you can quickly populate these fakes, creating realistic scenarios for your agent.

We have a number of these in our eval suite:

Does our episode extraction retain certain key bits of context?

Does the reflection agent properly create/update reflections, capturing key lessons learnt or other information from the episodes?

Does the coding agent read and properly utilise the information in the memories to write better code?

Do sub-agents behave as expected within a coordinator agent?

This gives a good tradeoff for quickly iterating on the shape of the inputs. For example, does a reflection with this information in it successfully steer the agent towards a better outcome? This allows you to understand not just how to prompt your agent, but also how to present the state of the world to it to set it up for success.

At this level, the differences...

Unit Testing's Eval Twin

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast