VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

ilreb1 pts0 comments

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Introduction

VeriEvol asks a simple scaling question: what if harder multimodal math prompts are not enough unless their answers are also verifiably reliable?

Why Verification Matters

In visual mathematical reasoning, wrong answers become especially damaging at the RL stage: every rollout can repeatedly convert a noisy label into reward signal. VeriEvol moves answer reliability into the data-construction stage, before any policy update.

Two Independent Axes

Prompt difficulty scales through route-specific evolution operators, while answer reliability scales through offline hypothesis-test falsification. This decoupling makes each part extensible and auditable.

RL-Compatible Outputs

Accepted examples are materialized as standard prompt-answer-reward tuples, so the resulting VeriEvol-RL data can plug into existing GRPO-style recipes without changing the optimizer.

Method

The framework turns low-difficulty image-question seeds into verified training samples through prompt difficulty control, answer reliability control, and closed-loop refinement.

Type-aware evolution generates harder image-grounded prompts. HTV-Agent verifies candidate answers through solver hypotheses, refutation channels, a conflict-aware decider, and a deterministic gate. Accepted samples are reused for SFT, RL, curriculum prioritization, and seed refresh.

Prompt Difficulty Control

Task routing sends seeds to route-specific operators such as decomposition, rephrasing, constraint strengthening, multi-hop reasoning, inverse solving, and visual grounding.

Answer Reliability Control

HTV-Agent accepts an answer only after multiple solver hypotheses and counter-evidence channels fail to refute it.

Closed-Loop Refinement

Rollout-based difficulty estimates prioritize useful samples and refresh the seed pool, keeping the construction loop focused on learnable challenges.

Scaling And Verification

VeriEvol combines data-volume scaling with explicit verification, then validates the effect through training dynamics and verifier ablations.

Evolved prompts reach a higher terminal GRPO reward while sustaining higher policy entropy during training.

HTV-Agent improves raw single-call accuracy with complementary gains from self-consistency voting and programmatic checks.

10K → 250K Evolved SFT data raises average accuracy from 35.42 to 54.73.

10K → 130K Verified RL data scales from 54.52 to 59.12 average accuracy.

+3.88 Full VeriEvol gain over the un-evolved RL baseline.

+4.51pp Full HTV-Agent gain over raw single-call judging in the ablation.

Experiments

All internal rows share the Qwen2.5-VL-7B-Instruct backbone and evaluate on five held-out visual mathematical reasoning benchmarks.

Method<br>MathVista<br>Mini &uarr;<br>MathVision<br>Mini &uarr;<br>MathVerse<br>VO &uarr;<br>DynaMath<br>Worst &uarr;<br>We-Math<br>Strict &uarr;<br>Avg. &uarr;

External 7B-size baselines<br>OpenMMReasoner-7B79.5043.6063.8034.9053.8155.12<br>ReVisual-R1-7B73.1048.8053.6027.5042.0049.00<br>OVR-7B72.1038.2054.6033.5044.6048.60<br>MMR1-Math-v072.0029.0055.4027.9031.9043.24<br>WeThink-7B70.9027.2044.7024.4048.0043.04<br>VLAA-Thinker-7B68.0026.4048.2022.4041.5041.30<br>VL-Rethinker-7B73.7028.4046.4017.8036.3040.52<br>ThinkLite-VL-7B71.6024.6042.9016.5041.8039.48<br>MM-Eureka-Qwen-7B72.6032.1045.4023.0021.8038.98<br>InternVL3-8B70.5028.6033.9023.0037.5038.70<br>Qwen2.5-VL-7B-Instruct69.2021.8034.1018.0032.3035.08

Supervised Fine-Tuning (ours)<br>Seed-only SFT74.1036.1665.0226.9656.7651.80<br>VeriEvol-SFT76.6039.8067.0130.9459.3354.73

Reinforcement Learning (ours)<br>RL-Origin77.0041.4569.0430.5458.1955.24<br>RL-Evol78.1045.3969.6732.1460.0057.06<br>RL-Evol + Verifier (full VeriEvol)79.0047.3770.4635.7363.0559.12

Full VeriEvol adds +3.88 average points over RL-Origin, including +1.82 from evolved prompts and +2.06 from HTV-Agent verification.

Release

The paper describes the release of prompts, data, models, code, and the full verifier trace of every accepted sample. Public links were not included in the provided PDF, so this page keeps the paper live and reserves release slots for upcoming artifacts.

Paper PDF<br>GitHub Repository<br>Data and verifier traces pending URL<br>Model checkpoints pending URL

Citation

@article{li2026verievol,<br>title={VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct},<br>author={Li, Haoling and Zheng, Kai and Wu, Jie and Xu, Can and Sun, Qingfeng and Hu, Han and Yang, Yujiu},<br>year={2026}

verievol data scaling reasoning answer uarr

Related Articles