BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

BenchEvolver — Frontier Task Synthesis via Solution-Centric Evolution

🧬">

91 problems in LiveCodeBench-Plus

−41.3 pts avg Pass@1 drop on evolved Hard split

30 algorithm categories (up from 19 in seeds)

+8.7 Pass@1 from RL self-improvement

The saturation problem

Frontier models solve almost everything . Static benchmarks have stopped telling models apart — and stopped providing useful training signal.

On LiveCodeBench, state-of-the-art models exceed 99% Pass@1 on the newest easy split and over 90% on average. Building new, sufficiently hard datasets by hand is slow and expensive — a bottleneck for continued progress.

The key inversion: evolve the solution, not the statement

Most benchmark generation methods are problem-centric: they start by writing a new task and hope it requires new reasoning. In practice, this often produces surface-level variants of existing problems, while still relying on increasingly strong models to solve and validate them. BenchEvolver flips the direction. We evolve solutions first, then derive tasks from them. Because the reasoning structure changes before the problem statement is written, the resulting benchmarks impose genuinely new algorithmic demands while retaining executable ground truth by construction.

🧬

Generate in solution space

Mutate the reference solution to force a dominant algorithmic lift , then derive the statement and tests around the evolved, executable solution.

Verify by consistency

Brute-force triangulation and statement-faithfulness checks ensure statement, solution, and tests define the same task — not a single LLM judge.

📉

Select by real failure

Difficulty is measured , not assigned: a candidate is accepted only if a panel of target models empirically fails more than on the seed.

See it in action

One mutation, a whole new algorithm

The surface story stays familiar; the underlying computation jumps to a different regime. The same solution-centric principle works across two very different coding domains.

Example 1 — Competitive programming LiveCodeBench

Seed · Pass@1 8/8 Copy Arrays

Count arrays whose adjacent differences match the original and whose entries satisfy per-index bounds.

one unknown: copy[i] = original[i] + d bounds: u_i ≤ copy[i] ≤ v_i solve: intersect intervals for d → O(N)

ALGORITHMIC LIFT

Evolved · Pass@1 4/8 XOR-Linked Sequence

Now adjacent XORs must match. The feasible sets are no longer contiguous — interval intersection fails.

one unknown: copy[i] = x XOR p_i bounds: u_i ≤ x XOR p_i ≤ v_i solve: XOR sets non-contiguous → digit-DP, O(N·bits)

Why it is harder: the seed is solved by an O(N) interval intersection over one free variable. Switching addition to XOR makes the constraints ui ≤ x ⊕ pi ≤ vi, whose solution sets are non-contiguous — requiring a bitwise digit-DP or trie. The parent's shortcut is provably insufficient.

Example 2 — Scientific coding SciCode

Seed · forward simulation RK4 Integrator

Implement a classical fourth-order Runge–Kutta integrator for a driven damped pendulum, returning the full state-space trajectory.

# given f, state, dt, n ... runge_kutta_4th_order(...) # integrate forward → trajectory

ALGORITHMIC LIFT

Evolved · inverse problem Fit ODE Trajectory (Gauss–Newton)

Estimate the unknown initial state and ODE parameters from sparse observations — turning integration into a full nonlinear solver.

# RK4 forward sim, then: damped Gauss-Newton + finite-diff Jacobian + backtracking line search

Why it is harder: the seed performs a single forward simulation of a known ODE. The evolved task inverts it — recovering unknown initial conditions and parameters from noisy, sparsely sampled observations. This requires repeated RK4 simulation inside a damped Gauss–Newton loop with finite-difference Jacobians and a backtracking line search, a qualitatively harder numerical-optimization pipeline.

The framework

A closed loop: Proposer → Evaluator → Memory

A Proposer evolves solutions and writes tasks; an Evaluator validates and measures empirical difficulty; a Memory module feeds accepted lineages and past failures back into search — turning repeated sampling into adaptive evolution.

Overview of BenchEvolver. A saturated seed task is mutated in solution space; the evaluator filters candidates for validity, diversity, and difficulty; memory records outcomes with reasons; accepted candidates become new parents.

🛠️

Proposer

Mutates the parent solution into a structurally different one, then derives a natural statement, public examples, and tiered hidden tests — all anchored by executing the evolved reference.

⚖️

Evaluator

Triangulates the reference, a brute-force solver, and a statement-only oracle to catch inconsistencies; runs bounded repair; then accepts only if the target panel empirically fails more.

🧠

Memory

Local memory tracks each seed's lineage and error patterns; global memory enforces diversity across seeds — a family that already succeeded must clear a higher...

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy