Getting Confidence in (Agentic) Code - UCSD GenAI and Programming SP26
Light (default)
Rust
Coal
Navy
Ayu
UCSD GenAI and Programming SP26
Unit 4: Getting Confidence in (Agentic) Code
section to this file, add a matching ["Label", "anchor-id"] line to<br>that array (in order) or the section won't appear in the sidebar. --><br>As programmers and software engineers, we talk a lot about code being<br>“correct” or “right” or “working”. We ship code, in products or<br>programming assignments, when we feel it's “done” (or when we've reached a<br>deadline), and it invariably has “bugs” – that is, it is not correct in the<br>strict sense.
At the same time, we have a lot of confidence that our code does<br>what we intended, for the most part. After all, we wrote it, and we thought<br>carefully about what it was supposed to do as we developed each helper<br>function, algorithmic loop, and API call.
As systems get larger, it is harder and harder to have this confidence. Web<br>browsers, operating systems, IDEs, and more are vast, complex codebases with<br>millions of lines of code, decades of history, and many authors. Having<br>confidence in these comes from a few main sources:
Trust, in the social sense. A trusted author wrote it, a trusted person did a<br>code review, and so on, and we trust the community's process and collective<br>judgment. Another kind of trust is being around for a while and not<br>changing very much. gcc has earned a certain level of trust because we<br>know it doesn't get overhauled every release and has been a known quantity in<br>production for decades.
Verification, in the software sense. A program may be very complex, but we<br>observe that a simpler artifact, like test cases or a type-based<br>specification, passes when run against the implementation. We can justifiably<br>have confidence in the code up to what we see from the verifier and our read<br>of the spec.
In reality, most large systems do have critical bugs. It's a bummer! However,<br>the software engineering and verification community has made a lot of progress<br>over the past few decades in bringing tool-based confidence to systems. Type<br>systems like Rust's aim to eliminate entire classes of correctness issues<br>related to memory management. Techniques like property-based testing and<br>fuzzing have found many critical bugs before (and after) release. We're<br>striving towards robustness and confidence in our code.
With agents capable of writing orders of magnitude more code than humans in the<br>same amount of time, the calculus of trust and confidence is undergoing a<br>significant shift. In Unit 2 we talked about slowing<br>Claude down to a pace we could review. At least partially, we were forcing<br>ourselves into the social and code review kind of trust. I (Joe) had a lot of<br>domain knowledge about web programming that I could bring to bear on the actual<br>code that was being generated, and acted as a trusted reviewer.
But what about important systems where confidence is critical yet no one person<br>can reasonably review all the code? Clearly agents are capable of building<br>large systems. But can they build large systems while giving us confidence<br>in them?<br>People are surely trying: the Claude C<br>Compiler got<br>widespread attention for trying to reproduce a gcc-like system in an<br>automated way; a rewrite of a JavaScript engine from Zig to<br>Rust was a conversation-starter<br>just this week; Claude documents a “Ralph Wiggum<br>loop” where agents work for a long<br>time, iterating on their own artifacts, until “done”.
We don't have the traditional signals of social trust for this code! In 2020,<br>the existence of a million line codebase implied something about human effort<br>and attention that translated into some “banked” trust. That is not true of<br>million-line codebases generated in days by agents. In those cases, the<br>confidence in the built system rests on the quality of the verifier. That is,<br>what properties need to be true of the system for us to have confidence in<br>it, and what verifiers can we write to check that those properties hold.<br>For example, if the Claude C compiler passes gcc's test suite, we can have<br>some confidence in it.
Q: What might give us more confidence?
In this unit, we are going to explore the interplay between agentic code<br>generation and verifiers. Two common “verifier”/“property” pairs:
Unit tests verify the property of input-output correctness (on a finite<br>set of examples (for a single run))
Type systems (that are sound) verify the property that a variable will<br>always hold a value of a particular type
CS curricula could benefit from including these going forward!<br>The actual space of verifiers is vast – there are static verifiers like<br>type systems or symbolic execution or static analysis, there are dynamic<br>verifiers like predicate checks or valgrind or asan or humble assert<br>statements. There are test-based and input-generation harnesses for<br>these like fuzzers or oracles or handwritten inputs.
Q: Define each of the terms in the preceding paragraph, using a competent<br>model or a web search...