A framework for verifiable analysis of AI behavior

Introducing Analysis Plans | Transluce AI

Docent is now in public alpha. Try the latest version of Docent here →

Introducing Analysis Plans A framework for verifiable analysis of AI behavior The Docent Team* * Correspondence to: selena@transluce.org

Transluce | Published: June 17, 2026

Developing an AI agent is a complex data analysis problem. To know if the agent is working correctly, we need to track not just benchmark scores but the details of its behavior: how do the strategies change over the course of training? Why does the new scaffold perform worse? Is there reward hacking? Answering these questions requires a combination of quantitative and qualitative analysis tailored to the dataset at hand.

Coding agents have the potential to accelerate this work, but they’re prone to subtle mistakes: they might parse data incorrectly, make unjustified assumptions, or cherry-pick examples and present a misleading narrative. These mistakes aren’t apparent in the final output. To trust a conclusion, we need to verify exactly how it was produced. But reviewing all the actions taken by a coding agent is tedious—crucial methodology decisions get buried in hundreds of lines of logs.

This problem motivated us to develop analysis plans, a framework for verifiable analysis of AI behavior. Analysis plans are specified in a Python API that any coding agent can work with. When they’re ready to run, they appear in a web interface that lets humans understand and verify every step that was taken.

Analysis plans contain two types of steps:

Query steps filter, group, and join your data using DQL (Docent’s subset of SQL). Each step is displayed with its query and an interactive table of the results.

Reading steps use an LLM to analyze data from a query step, producing a text summary and/or a structured judgment. Claims made by the LLM come with citations to specific items in its context.

These two step types can be customized and combined to build complex analysis pipelines. Readings can accept any data that a query produces, and queries can run over any reading results. At each step, results are traced to the exact computation that produced them, enabling you to inspect, audit, and refine the flow.

Let’s see what this looks like by detecting cheating on a popular software engineering benchmark.

Demo: identifying suspicious behaviors in SWE-rebench

Cheating is a common thorn when interpreting evaluation results: models famously hard-code tests, falsely claim success, and exploit unclean environments to copy solutions. Measuring rates of cheating is essential for understanding how much of a benchmark score represents a valid demonstration of model capability. In about 15 minutes, we used Docent to discover instances of cheating on SWE-rebench, a software engineering benchmark that measures how many recent GitHub PRs an agent can resolve. You can view the SWE-rebench traces in Docent at this link.

1. Explain to your coding agent what you want to learn

We start by prompting Claude Code to score agent runs for potential cheating. Thanks to the Docent skill, Claude knows how to turn our question into an analysis plan. It writes a short Python script like the following.

from docent import Docent client = Docent()

runs = client.query( COLLECTION_ID, "SELECT agent_runs.id AS run FROM agent_runs " "WHERE CAST(metadata_json->'scores'->>'resolved' AS DOUBLE PRECISION) = 1.0 " "ORDER BY agent_runs.id LIMIT 200", name=f"Sample 200 resolved runs by UUID",

DETECTOR_PROMPT = "..." # omitted for brevity OUTPUT_SCHEMA = { ... } # omitted for brevity

detect = client.read( prompt_template=[runs.run.as_type("agent_run"), DETECTOR_PROMPT], model="openai/gpt-5.5", reasoning_effort="medium", output_schema=OUTPUT_SCHEMA, name="Flag cheating from trajectory", When Claude runs this script, the Docent SDK doesn’t execute readings immediately. Instead, it builds up a graph of all the readings that are being requested (in this case, just one) and uploads it to Docent as an analysis plan. This lets us review the proposed analysis before waiting for LLM calls to complete.

2. Anatomy of an analysis plan

An analysis plan awaiting approval. This analysis plan starts with a DQL query to select successful runs by filtering the agent run metadata to resolved=1. After that, it passes the results to a reading step, which scores runs for cheating.

A reading step has the following components:

A prompt template with parameters. To detect cheating, we create a rubric that defines suspiciousness, taking one agent run as a parameter. The Docent UI shows the full prompt template, with parameters represented inline. Clicking on the "Context: run" chip below exposes additional detail on what run-level metadata is passed to the LLM. By default, no metadata is passed.

Arguments from a previous query step. The previous query step in our plan produced a table of runs where resolved=1. For each row in this table, Docent will substitute the full text of the...

A framework for verifiable analysis of AI behavior

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews