Getting Confidence in (Agentic) Code

Getting Confidence in (Agentic) Code - UCSD GenAI and Programming SP26

Light (default)

Rust

Coal

Navy

Ayu

UCSD GenAI and Programming SP26

Unit 4: Getting Confidence in (Agentic) Code

section to this file, add a matching ["Label", "anchor-id"] line to that array (in order) or the section won't appear in the sidebar. --> As programmers and software engineers, we talk a lot about code being “correct” or “right” or “working”. We ship code, in products or programming assignments, when we feel it's “done” (or when we've reached a deadline), and it invariably has “bugs” – that is, it is not correct in the strict sense.

At the same time, we have a lot of confidence that our code does what we intended, for the most part. After all, we wrote it, and we thought carefully about what it was supposed to do as we developed each helper function, algorithmic loop, and API call.

As systems get larger, it is harder and harder to have this confidence. Web browsers, operating systems, IDEs, and more are vast, complex codebases with millions of lines of code, decades of history, and many authors. Having confidence in these comes from a few main sources:

Trust, in the social sense. A trusted author wrote it, a trusted person did a code review, and so on, and we trust the community's process and collective judgment. Another kind of trust is being around for a while and not changing very much. gcc has earned a certain level of trust because we know it doesn't get overhauled every release and has been a known quantity in production for decades.

Verification, in the software sense. A program may be very complex, but we observe that a simpler artifact, like test cases or a type-based specification, passes when run against the implementation. We can justifiably have confidence in the code up to what we see from the verifier and our read of the spec.

In reality, most large systems do have critical bugs. It's a bummer! However, the software engineering and verification community has made a lot of progress over the past few decades in bringing tool-based confidence to systems. Type systems like Rust's aim to eliminate entire classes of correctness issues related to memory management. Techniques like property-based testing and fuzzing have found many critical bugs before (and after) release. We're striving towards robustness and confidence in our code.

With agents capable of writing orders of magnitude more code than humans in the same amount of time, the calculus of trust and confidence is undergoing a significant shift. In Unit 2 we talked about slowing Claude down to a pace we could review. At least partially, we were forcing ourselves into the social and code review kind of trust. I (Joe) had a lot of domain knowledge about web programming that I could bring to bear on the actual code that was being generated, and acted as a trusted reviewer.

But what about important systems where confidence is critical yet no one person can reasonably review all the code? Clearly agents are capable of building large systems. But can they build large systems while giving us confidence in them? People are surely trying: the Claude C Compiler got widespread attention for trying to reproduce a gcc-like system in an automated way; a rewrite of a JavaScript engine from Zig to Rust was a conversation-starter just this week; Claude documents a “Ralph Wiggum loop” where agents work for a long time, iterating on their own artifacts, until “done”.

We don't have the traditional signals of social trust for this code! In 2020, the existence of a million line codebase implied something about human effort and attention that translated into some “banked” trust. That is not true of million-line codebases generated in days by agents. In those cases, the confidence in the built system rests on the quality of the verifier. That is, what properties need to be true of the system for us to have confidence in it, and what verifiers can we write to check that those properties hold. For example, if the Claude C compiler passes gcc's test suite, we can have some confidence in it.

Q: What might give us more confidence?

In this unit, we are going to explore the interplay between agentic code generation and verifiers. Two common “verifier”/“property” pairs:

Unit tests verify the property of input-output correctness (on a finite set of examples (for a single run))

Type systems (that are sound) verify the property that a variable will always hold a value of a particular type

CS curricula could benefit from including these going forward! The actual space of verifiers is vast – there are static verifiers like type systems or symbolic execution or static analysis, there are dynamic verifiers like predicate checks or valgrind or asan or humble assert statements. There are test-based and input-generation harnesses for these like fuzzers or oracles or handwritten inputs.

Q: Define each of the terms in the preceding paragraph, using a competent model or a web search...

Getting Confidence in (Agentic) Code

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast