LEVI: Stronger Search Architectures Can Substitute for Larger LLMs | Temoor Tanveer
TT">
← Back to home<br>LEVI: Stronger Search Architectures Can Substitute for Larger LLMs
A harness-first evolutionary framework for code and prompt optimization. Better scores than frontier-model runs of GEPA, OpenEvolve, ShinkaEvolve, AdaEvolve, and EvoX, at 3.3–6.7× lower cost.
June 2026
Most LLM-guided evolutionary systems get their results the expensive way: by pointing a frontier model at the problem and burning through hundreds or thousands of expensive calls. LEVI takes the opposite bet, that a stronger search architecture can substitute for a larger model and drastically reduce the cost. Fix the harness so the archive preserves diverse solutions instead of the model, route mutations to the model that actually fits the job, and stop re-scoring redundant examples, and strong results follow even from small open-source models at a fraction of the budget.
This holds across two very different settings:
Code optimization. On systems-research benchmarks, LEVI beats the best published frontier-budget runs of GEPA, OpenEvolve, ShinkaEvolve, AdaEvolve, and EvoX on six of seven problems, at 3.3–6.7× lower cost.
Prompt optimization. On four GEPA-suite benchmarks, LEVI matches or exceeds GEPA using less than half the rollouts.
GitHub<br>Documentation
LEVI reaches higher final performance using a fraction of the budget. Left: on Transaction Scheduling code optimization, LEVI exceeds every baseline's final score within the first ~50 evaluations (≈15× sample efficiency). Right: on HotpotQA prompt optimization with Qwen3-8B, LEVI outperforms GEPA with fewer than half the rollouts (~2.75K vs ~6.87K).
Why this is a harness problem, not a model problem
Frontier-model dependence is largely an artifact of how existing frameworks allocate search, not a fundamental requirement.
LLM-guided evolution pairs a language model with a search loop: the user provides a problem and a scoring function, the LLM mutates candidate solutions, an evaluator scores them, and a solution database keeps the promising ones around. The paradigm was introduced by FunSearch11 FunSearch (Romera-Paredes et al., 2024) produced novel mathematical results, including the cap-set improvement, without a frontier-scale model. and scaled up by AlphaEvolve,22 AlphaEvolve (Novikov et al., 2025) extended the loop to stronger LLMs and larger codebases. and it now spans math, code optimization, heuristic design, systems research, and prompt optimization.
The catch is cost. A single run can require thousands of calls to expensive frontier models, and reported systems-research runs often cost $15–30 per problem on GPT-5 or Gemini 3.0 Pro. That price raises the barrier to entry and slows iteration. But it is not obviously inherent to the paradigm; FunSearch produced its results without a frontier model at all. We argue the cost comes from over-relying on larger models while under-investing in the search architecture, and it shows up along three separate axes:
Per-evaluation. When the archive fails to preserve diversity, the search collapses into a single basin and leans on a strong model to generate escapes. Existing frameworks patch this after the fact with islands, embedding-based novelty filters, or LLM judges, each compensating for convergence rather than preventing it.
Per-dollar. Mutation calls are treated uniformly, so frontier-model prices get paid even for local edits a small model could handle.
Per-rollout. Every candidate is re-scored on the full validation set, spending rollouts on redundant examples, which is especially painful in prompt optimization.
LEVI fixes all three. Rather than building the harness around the assumption of a strong model, we ask what the search architecture should look like when the budget is limited.
LEVI
Three components, one per cost axis: a diversity-preserving solution database, a role-aware mutation router, and a rank-preserving proxy benchmark.
LEVI uses one bootstrapped seed pass to initialize solution families, calibrate the CVT-MAP-Elites database, and build the proxy benchmark. During search, the archive feeds an asynchronous mutation–evaluation loop: most mutations use a small LLM for local refinement, while periodic paradigm shifts use a stronger LLM to propose structurally new candidates.
LEVI follows an asynchronous AlphaEvolve-style loop. The solution database stores evaluated candidates, samplers draw parents from it, the router sends mutation requests to the appropriate model, and resulting candidates are evaluated in parallel before being inserted back. Parents are sampled with a softmax over scores, with worker-specific temperatures balancing exploration and exploitation across the parallel pool. The three pieces below are best read as extensions of each other: the archive provides the structure that makes principled routing possible, and principled routing is what makes a diversity-preserving archive...