Vericoding: The End of "Trust Me Bro, The AI Wrote It".
Sign in
Subscribe
Vibe coding won. That debate is over. GitHub says 46% of all new code is AI-generated. Among Y Combinator's Winter 2025 cohort, 21% of startups have codebases that are 91% or more AI-generated. Gartner forecasts 60% of all new software code will be AI-generated by end of 2026. The machines are writing the code. Congratulations.<br>Now the bad news.<br>AI co-authored code has 2.74x more security vulnerabilities than human-written code. 45% of it fails security tests. Vibe-coded projects accumulate technical debt 3x faster. METR ran a randomized controlled trial with experienced developers and found they were 19% slower with AI tools, while believing they were 24% faster. 95% of developers report feeling productive while measurably producing lower-quality code.<br>AWS themselves have said it plainly: review capacity, not developer output, is the bottleneck in 2026. We can generate code at machine speed. We cannot verify it at machine speed. That asymmetry is the whole problem.<br>And it's already exploding. The Tea App breach exposed 72,000 government IDs because a database was just… open. The Orchids platform had a zero-click vulnerability that sat unfixed for months. Security experts are predicting "catastrophic explosions" as vibe-coded applications hit production at scale. Not might happen. Are happening.<br>The software industry automated code generation but left verification in 2019. That's like building a car factory that runs at 10,000 units per hour and staffing the quality inspection line with two guys and a clipboard.<br>Vibe Coding vs. Vericoding<br>The term vericoding was introduced by researchers at the Beneficial AI Foundation in late 2025. Where vibe coding generates potentially buggy code from natural language descriptions, vericoding generates formally verified code from formal specifications. Code that is mathematically proven correct. Not tested. Not probably correct. Proven.<br>The difference is not incremental. It is categorical.<br>A test suite says: "I tried 1,000 inputs and none of them broke." Formal verification says: "I have a mathematical proof that no input, out of the infinite space of all possible inputs, can break this." Tests demonstrate the presence of bugs. Proofs demonstrate their absence.<br>The vericoding benchmark tested 12,504 formal specifications across three verification languages: Dafny, Verus/Rust, and Lean. Using off-the-shelf LLMs with no special training or fine-tuning, vericoding success rates hit 82% in Dafny. Pure Dafny verification improved from 68% to 96% over the past year. The curve is steep and accelerating.<br>Dafny is a verification-aware programming language created by K. Rustan M. Leino, now at AWS's Automated Reasoning Group. It compiles through Boogie into SMT-LIB2 queries that are checked by Z3, an SMT solver. If Z3 says UNSAT, your code is proven correct against its specification. Not probably. Proven.<br>AWS uses this exact stack to verify their authorization engine, the one handling over a billion API calls per second. They verified Cedar, their authorization policy language, in Dafny and proved it satisfies security properties with mathematical certainty. Then they used differential testing against quadrillions of production authorizations to confirm the Rust implementation matches. The result? A 65% performance improvement and provable security.<br>This is not academic technology looking for a problem. This is production infrastructure running at planetary scale.<br>The Gap Everyone Sees, Nobody Has Closed<br>Every vericoding system today, including the research from AWS's own DafnyPro and ATLAS teams, assumes you already have a formal specification. Someone has to write:<br>requires amount = 0That someone is currently a verification expert. There are maybe a few thousand people on Earth who can write Dafny specifications fluently. That's not a market, that's a mailing list!<br>The pipeline academia is building looks like this:<br>Formal Spec → LLM → Verified Code → Production<br>DafnyPro hits 86% on this. ATLAS generates thousands of verified programs. The vericoding benchmark shows 82% success. All impressive. All starting from a formal spec that a human expert already wrote.<br>The pipeline we have starts earlier, where the actual problem is:<br>Natural Language → Formal Spec → Human Review → LLM → Verified Code → Production<br>That first step, natural language to formal spec, is the reason vericoding hasn't become a product. It's the hardest translation, the one the researchers explicitly scope out of their papers. The vericoding benchmark authors said it directly: "We acknowledge that spec generation is an important problem, but focus on the task of generating implementations and formal proofs in this work."<br>We're building the part they left out. And we have an unfair advantage.<br>We Already Do This. It's Called PreFlight.<br>If you've been following this blog, you know ICME's PreFlight. You write a guardrail policy in plain...