Metamorphic testing with Lean4-verified mutations finds compiler miscompilations

Compiler Testing — Part 2: Metamorphic Testing with Verified Identities | nowarp

Skip to main content<br>Miscompilations and semantic drifts are bugs where the compiler accepts a valid program and silently emits code that behaves differently. On smart-contract platforms this is the bug class that can drain user funds. The post covers differential testing, metamorphic mutations and their sound synthesis — each mutation is proven equivalence-preserving in the developed framework written in Lean4. We will also discuss some of the miscompilations found , some of which resulted in bug bounties.

This post may be useful to you if you:

Develop, maintain, or test a programming language, especially one targeting smart contracts

Do security audits on new compilers or runtime environments

Do security audits of smart contracts and want to include additional checks on compiler correctness that may be an additional attack vector

Own a smart contract project and want to understand the compiler-level attack surface

The post is organized as follows:

Background — scope, introduction of miscompilation errors and oracles used

Verified mutations — the executable semantics, what makes a mutation sound, and the hand-written corpus of identities by category

Synthesizing mutations — automatically synthesizing sound mutations with Lean feedback

Testing harness — program generation and mutation, input generation

Findings – some real-world findings found using this approach, including bug bounty findings

Conclusion — summary and further work

If you're experienced, here's the whole approach in essence; dive into the interesting parts later:

Build a Lean4 eDSL implementing a subset of the target language's semantics.

Hand-write textbook identities and prove them sound in that semantics — we will consider arithmetic over the 256-bit word ring, linear/poly MBA, some QF_BV, and ad-hoc identities that hit common cryptographic builtins and peephole optimizers.

Use the executable semantics to synthesize more sound mutations (bounded, with ad-hoc filters): generator + bottom-up search + Lean kernel as an oracle — simpler than counterexample-based approaches.

Run the fuzzing pipeline, leveraging both differential and metamorphic testing.

It is not that hard to build, and — implemented properly — it does find bugs in corner cases that other fuzz-testing tools miss.

Background

The previous post targeted compiler crashes and quality issues only.

This part targets more tricky bugs: semantic drifts and miscompilations. This means that the implementation of the smart-contract/program is correct and matches the specification, but because of the bug in the compiler, the actual implementation behaves differently. The targeted bugs may be located in the compiler, corresponding tooling like linkers and the runtime environment.

Miscompilation bugs

The most popular early article on incorrect compilation is "Reflections on Trusting Trust" by Ken Thompson in 1984 [1]. While the whole idea has been known for years, these bugs are better known nowadays in Web3, where they can easily affect smart contracts with real money.

Miscompilations are invisible to smart-contract auditors – the source code is correct, but the actual implementation diverges because of the bug in the trusted base. Worse, they affect all the contracts uploaded on-chain — any contract containing the specific pattern may misbehave or just imply some interesting behavior.

The only exploited bugs so far are the Vyper non-reentrancy hack ($69.3M) and the Cetus hack ($213M) – not in the compiler but related to unexpected semantics in the math library. Other bugs could have caused losses but were fixed before being exploited, such as the TSTORE-poison codegen bug in solc and the billion-dollar Move verifier bug.

That's why compiler bugs are worth hunting.

Oracles

The main problem when hunting miscompilations is how to detect when the program miscompiles. When working with ICE, you get the oracle for free: if the compiler panicked, you got a bug. With miscompilations you have to understand how the compiled program should behave and what the correct behavior is.

In fuzzing, this problem is solved by the oracle: a part of the tool that decides whether a program's output is correct. There are different approaches, but we will consider the ones we use: differential testing and metamorphic testing.

Differential testing

The key idea of differential testing: execute another implementation of SUT with the same inputs and compare the results.

For compiler testing we could use different strategies, to name some:

Different implementation of the same language for a different platform (e.g. resolc for Polkadot and solc for Ethereum)

The same compiler with different options (e.g. different optimization passes enabled)

Different underlying platform (e.g. solc-compiled EVM bytecode may be executed on evmone and revm)

Emit semantically equivalent code for similar languages and...

Metamorphic testing with Lean4-verified mutations finds compiler miscompilations

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7