AI-generated code that works – and proves it

AI-generated code that works — and proves it | Code Metal Skip to main contentAI-generated code that works — and proves it Combining AI with formal methods to build trusted code translation systems for safety- and mission-critical software.

Published on May 18, 2026 ← Back to Research How do you know that code behaves as expected? For decades, the standard answer in software engineering has been straightforward: write tests, run them, and check that they pass. But testing only tells you that a program behaved correctly on the inputs that were tried. It says nothing about all the inputs that were not. As Edsger Dijkstra famously observed more than fifty years ago: “Program testing can be used to show the presence of bugs, but never to show their absence.” While testing remains one of the most effective tools for discovering bugs, safety- and mission-critical systems cannot rely on post-deployment patches when failures occur. A bug in avionics software, semiconductor tooling, embedded systems, financial infrastructure, or autonomous systems can have severe consequences. In these environments, the question is not merely whether software appears to work during testing, but whether we can establish stronger guarantees about all possible behaviors of the system. That is the role of formal methods. Formal methods are mathematically rigorous techniques for proving that a program satisfies a specification. Rather than reasoning only about observed executions, formal methods reason about every execution permitted by the system. For decades, these techniques have been used in some of the world’s most demanding software environments, including aerospace, rail infrastructure, operating systems, and cryptographic systems [1,2,3,4,5,11]. The challenge in scaling formal methods to general software has historically been cost: building machine-checked proofs traditionally required deep expertise and enormous engineering effort, limiting the use of formal methods to domains where correctness was absolutely critical. Today, however, the equation is beginning to change. Trusted code translation at Code Metal At Code Metal, this is the central problem we are trying to solve: building AI-driven code translation systems for domains where correctness is non-negotiable. Whether translating CUDA kernels into OpenCL, converting M files into VHDL, or modernizing legacy C++ into Rust, the challenge is not merely generating code that compiles, but proving that the translated system preserves the behavior and guarantees of the original one. This is exactly the kind of setting where formal methods can provide uniquely strong guarantees. The companies we work with in aerospace, defense, semiconductors, and automotive operate under strict correctness requirements where failures can lead to recalls, certification issues, security vulnerabilities, or mission failure. In these settings, checking that the code compiles is not enough. The real question is: Does the translated code preserve the behavior and guarantees of the original one? That is a fundamentally difficult problem. Two programs written in different languages and targeting different architectures will almost never be identical internally. A Rust implementation and its C++ counterpart may differ substantially in memory layout, optimization strategies, execution order, and low-level behavior. What matters instead is whether they agree at the level of observable behavior: Do the same inputs produce the same outputs? Are the same invariants preserved? Are the same safety and compliance guarantees maintained? Does the translated system preserve the intended semantics of the original? Code Metal’s approach uses a full spectrum of software-assurance techniques throughout the translation pipeline. Lightweight validation techniques such as differential and property-based testing, type and compatibility analysis, and static analysis continuously guide and validate the AI system during generation, helping uncover behavioral discrepancies, semantic mismatches, vulnerabilities, and undefined behavior early in the process. At the stronger end of the spectrum, formal verification is used whenever possible to establish rigorous guarantees about behavioral equivalence between the original and translated systems. Validation and verification are integrated throughout the entire pipeline rather than treated as a final post-processing step. Source programs are analyzed, translated, continuously checked for semantic consistency, and optimized for the target hardware architecture while preserving intended behavior. The goal is not merely to generate code quickly, but to generate software that can be trusted.

Why now: AI and formal methods Historically, formal methods were powerful but difficult to scale. Building machine-checked proofs often required years of specialized work. Systems such as the formally verified seL4 microkernel [6] and the CompCert verified C compiler [7] demonstrated that full...

AI-generated code that works – and proves it

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast