Formal Verification Benchmarks Are the Key to Ironclad Software Infrastructure

Galois - Formal Verification Benchmarks Are the Key to Ironclad Software Infrastructure

Tools

News & Insights

About Us

Get Started

GET IN TOUCH We take pride in personally connecting with all interested partners, collaborators and potential clients. Please email us with a brief description of how you would like to be connected with Galois and we will do our best to respond within one business day.

Email contact@galois.com PHONE 503.626.6616

Formal Verification Benchmarks Are the Key to Ironclad Software Infrastructure

Santiago Cuéllar and Chris Liberatore June 9, 2026

Formal verification is the gold standard for software correctness, and we argue that AI could finally make it practical at industrial scale. Seriously. We are inviting you to dream of a world where critical infrastructure ships with ironclad guarantees of safety. Where we can patch our buggy systems once and for all. Where even vibe code can be verified and trusted. But are we actually moving in that direction? If so, what is missing—and how do we accelerate the process to ensure we get there? For all their incredible utility, modern LLMs are spectacular mimics. Having ingested the entire internet—including my old grad school Rocq spaghetti proofs—they appear highly capable of synthesizing complex mathematics. And yet, LLMs are notorious for hallucinating or lying. How do we have confidence that AIs are reasoning about code, not just parrotting things they have memorized? To drive them towards the future we envision, we must understand their capabilities and failures. The foundation that makes the verification revolution possible is something surprisingly mundane: measurement – building benchmarks that capture the realities of verification engineering, expose where models fail, and drive progress toward industrial-scale verification. But taming the AI models that enable scalable verification is a massive undertaking, and no single team can do it in a silo. We need a community-wide effort to build the rigorous tests that will actually push this field forward. We must design high quality formal verification benchmarks. Galois is leading the charge with two new projects—Metron and Praxis—measuring AI capability in reasoning about code, and how well it can use this reasoning to harden and upgrade existing codebases. Why Are Benchmarks Important? “To think is to forget differences, generalize, make abstractions.” - Jorge Luis Borges, “Funes, the Memorious” To confidently assess the correctness of AI output in FV applications, we need a way to measure and compare it against well-reasoned and agreed-upon benchmarks. Benchmarks tell us if LLMs generalize their reasoning abilities beyond memorization . We know these models generalize to some extent, but their massive pre-training data blurs the line between genuine reasoning and sophisticated regurgitation. How do we tell which is which? Recent systems have already demonstrated the ability to verify nontrivial cryptographic software and synthesize complex proofs, but we need more confidence that these capabilities will generalize to new domains, scale to industrial workflows, or robustly handle unfamiliar verification challenges. Benchmarks help us understand when and how LLMs fail . We mostly use AI as a black box. We don’t yet know which verification tasks models naturally master during training and what improves with scale. More importantly, when an AI gets stuck, we don’t know if the bottleneck is a fundamental flaw in its reasoning, shortcoming of its training, or just a workflow roadblock that better prompting and external tools could fix. Well-designed benchmarks provide these intermediary signals that indicate the boundary of current capabilities and help characterize failure modes. Moreover, benchmarks shape what frontier labs optimize for. Making rigorous verification matter and performance easily measurable is one of the clearest ways to steer the field toward a future where machine-checked proofs of software correctness become abundant, cheap, and woven into the fabric of everyday computing. What Makes a Good Benchmark? AI capabilities are evolving rapidly, and verification benchmarks must evolve with them. But the direction is already clear: the next generation of evaluations must satisfy three core requirements. Real-world tasks. Benchmarks are only useful if they measure the capabilities we actually care about. We want AI systems that can reason about the correctness of production-grade infrastructure, not just solve puzzles that don’t reflect reality. Accordingly, benchmark tasks should reflect the realities of deployable software and the kinds of verification problems that emerge in real engineering environments. Maps of capability . Pass/fail evaluations are useful (and using them was key to expanding capabilities of LLMs!), but they conceal the most important information: where reasoning breaks down. Just as there are many ways to get a question wrong, there are also...

Formal Verification Benchmarks Are the Key to Ironclad Software Infrastructure

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi