We got our AI code reviewer to run your code

Building TREX: Code Execution and Artifact Generation for AI Code Review | Greptile

Introducing TREX: Greptile Now Runs Your Code.TREX Runs Your Code.<br>Learn More

Building TREX: Code execution and artifact generation for AI code review<br>[ Shlok Mehrotra | 2026-06-17 ]

Contents<br>Table of Contents

navigation|BlogBuilding TREX: Code execution and artifa...<br>Table of ContentsTable of Contents

I'm Shlok, a software engineer at Greptile. We recently built a code reviewer that, in addition to reviewing pull requests, actually runs the code and shows you what went wrong.

In 1976, Michael Fagan published a paper introducing formal code inspection at IBM. Developers would print out listings, sit in a room together, and read through the code line by line.

Today we still read a diff on a screen. AI tools have made that faster, though most of them are still just reading the code. This approach works for a lot of bugs, the ones that announce themselves plainly in code.

The problem is there's a whole category of bugs that don't show up in code at all; they exist when the program is running. Think of the logic error that needs a specific sequence of state, the UI regression that appears after the page loads, or the race condition that needs a real request. You can read the diff perfectly and still miss these types of bugs completely.

Static code review has a ceiling. It can reason about what the code says. It can't tell you what it does. TREX (which stands for "Test, Run, Execute") is Greptile's response to that ceiling: an execution layer built directly into code review.

Orchestrating agents without wasting context

TREX started as a completely separate product from Greptile, as a standalone agent that generated and ran tests. We hoped that bugs would surface as a result. They didn't. Generating tests wasn't the same activity as finding bugs. When the separate TREX agent tried to write tests, the tests weren't relevant to what the user was trying to do. This created unnecessary noise, and it also missed edge cases. This sounds obvious in hindsight, but it took us more time than expected to learn this lesson.

We'd built these agents to be separate with the assumption it would give each agent its own context window. It also meant both agents ran separately without sharing knowledge. They often overlapped, exploring the same parts of the codebase twice without either agent knowing what the other had already found, ultimately leading to wasted compute.

The obvious fix seemed like combining them into one agent. We tried that, and ran into a different problem: a single agent handling the full review got overloaded. Between spinning up services, taking screenshots, running tests, there was too much context for one agent to manage cleanly.

The solution was to make TREX share the same context as the main Greptile reviewer rather than having it exist entirely as a separate product. It was the first time we were managing agents from within an agent. Unlike two independent agents, this means TREX doesn't start from scratch. It inherits what the Greptile reviewer agent already found, has its own context window, and is scoped to the specific problem it's been asked to investigate.

The Greptile reviewer agent acts as an orchestrator. It reads the diff, identifies issues worth investigating, and spins up a dedicated TREX agent per issue, all running in parallel. The TREX agents have the liberty, the compute, and the knowledge of the orchestrator agent.

A good example of this is a UI feature hidden behind an auth gate. Testing it locally means setting up the environment, handling authentication, getting the feature flag in the right state. A subagent figures all of that out on its own and comes back with a screenshot of the rendered feature.

Designing multi-modal artifacts to show the work

The first version of TREX output findings as bullet points listing out what was tested and what happened. This was a reasonable starting point, but it didn't provide sufficient information.

An agent or a human reviewer reading a bullet point like, "Tested the checkout flow, found failure" wouldn't find it very useful. They wouldn't be able to tell where in the process something went wrong. If the test failed, was it the setup? The assertion? An environment issue? We found an early version of the agent would sometimes hallucinate about how thoroughly it had tested something, claiming to have tried something it hadn't. Bullet points gave us no way to verify.

The fix was to back the bullet point list with a multi-modal artifact set for each TREX finding: screenshots, logs, API traces, execution scripts. Each modality covers a different part of the story. Having a comprehensive picture of everything that was tested for a specific issue is what actually matters.

The first artifact that made us say "Wow" was video. If you push an animation change, TREX captures a video of it playing. You can see exactly what the animation looks like without...

We got our AI code reviewer to run your code

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews