Teaching AI to Reason About Software

Teaching AI to Reason About Software - Soteria

Back to Blog Teaching AI to Reason About Software A team at AWS taught a small language model to catch bugs in C by training it on Soteria’s symbolic execution traces, beating a model four times its size. We look at what they did and why it matters.

Azalea Raad, CEO · 1 July 2026 RSS A team at Amazon Web Services (AWS), working independently of Soteria, recently published research that uses Soteria to teach AI models to reason about code. Their headline result is striking: by continuing to train an 8-billion-parameter model on just a few thousand symbolic execution traces from Soteria, they made it better at catching bugs in C than a model four times its size. Specifically, by training models on the rich semantic execution traces produced by Soteria, they showed significant improvements in the models’ ability to understand code and detect correctness violations. In other words, they showed that Soteria can teach AI how programs actually work! In this post, we take a closer look at that work, explain why Soteria’s traces are uniquely valuable as AI training data and explore what this means for the future of AI-assisted software engineeing. Today's AI training Source Code Model Training Code Generation

With Soteria Source Code Symbolic Execution Semantic Traces Model Training Program Reasoning Better Code Generation

AI Can Write Code. Can It Tell Whether Code Is Correct?

Modern AI coding assistants are transforming the way and pace at which software is developed, allowing engineers to move faster than ever before. They can generate entire functions, implement complex APIs and produce substantial software systems from a short prompt. Despite this progress, today’s AI systems still suffer from a fundamental limitation. They can generate code that looks correct, but they often struggle to understand whether that code is actually correct. Large language models learn from enormous quantities of source code. They see billions of lines of software from repositories, documentation, tutorials and technical discussions. However, software engineering is not simply about producing syntactically valid code. A human software engineer develops an understanding of programs by observing how they run. We learn through debugging sessions, execution traces, test failures and careful inspection of program state. This process teaches us something deeper than syntax. It teaches us semantics! Furthermore, today’s frontier models (such as Mythos) have demonstrated just how powerful AI can be when applied to program analysis and bug detection. However, these models are extremely large and expensive to train and deploy. If we can teach smaller, more efficient models to reason about programs just as effectively, or further improve the reasoning capabilities of state of the art models such as Mythos, we can make high quality AI-assisted software verification both more capable and significantly more accessible. Mind The Gap

The AWS researchers measured this gap directly. They built an evaluation of 500 verification tasks in C, covering memory safety, overflows, termination, reachability and data races, drawn from SV-COMP 2025 , a software-verification benchmark. For each task, a model has to decide one thing: does the property hold, or is it violated? Across 14 models from six families, a clear pattern emerged: models are very good at confirming that a property holds (most scoring above 90%) but much worse at detecting when one is violated. Four of the 14 models caught fewer than half of the real bugs, and accuracy fell off sharply as programs grew longer: a model dropped below 10% on programs of just 100–200 lines! Here is the kind of program that trips models up: terminate.c 1 extern unsigned int nondet_uint(void); 2 3 int main() { 4 unsigned int x = nondet_uint(); 5 if (x > 0) { 6 while (x != 0) { 7 x = x - 2; 8 } 9 } 10 return 0; 11 }

Does this loop always terminate? When x is odd, the repeated subtraction wraps around past zero and the loop runs forever. One 675-billion-parameter model spotted the unsigned wrapping but still concluded, incorrectly, that the loop must reach zero. The code is short; it’s the reasoning about how it executes that’s hard. That is the gap. AI has learned to write code. The harder problem is teaching it to understand how code runs, and once it does, to find bugs before they reach production. Making Program Semantics Visible

This is where Soteria enters the picture. Soteria is a symbolic execution tool designed to reason about software behaviour at a deep semantic level. Rather than executing a program using a single concrete input, symbolic execution systematically explores many possible execution paths and reasons about what can happen across all of them. (You can read more about Soteria in our previous post). Along the way, Soteria records a great deal: program states, symbolic values, path conditions, branch decisions and the precise circumstances...

Teaching AI to Reason About Software

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI