FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale | Frontier-CS Blog Posts FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale<br>We release FrontierSmith, a system that converts closed-ended coding problems into open-ended optimization tasks for training long-horizon coding agents.
TL;DR. We are releasing FrontierSmith , a system for synthesizing open-ended coding problems at scale. Starting from closed-ended programming tasks, FrontierSmith produces optimization-style problems with continuous scores, filters them for real solution diversity, and builds runnable training environments. In our experiments, models trained on FrontierSmith data can outperform models trained on human-curated open-ended data, and the resulting tasks elicit genuinely long-horizon agent behavior.<br>The Data Bottleneck<br>Code agents are getting very good at repetitive software work. Across research labs and startups, the next important question is increasingly about whether AI can solve open-ended optimization problems that matter in the real world: chip placement and routing, logistics, power-grid scheduling, database tuning, kernel optimization, and many others.<br>But our previous FrontierCS on Harbor blog showed a clear weakness. Today’s code agents are much less reliable in long-horizon, open-ended optimization than they are on traditional contest or math-style tasks.<br>We think a major reason is data.<br>Classic RLVR settings have huge amounts of high-quality training data. Competitive programming alone has more than 100,000 public problems, and the broader coding-data industry is continuously producing more. By contrast, if we add together open-ended optimization benchmarks such as FrontierCS, ALE-bench, KernelBench, and the recent MLS-Bench, we still only get hundreds of tasks.
Closed-ended coding data is abundant. High-quality open-ended coding tasks are still rare.<br>That gap is the bottleneck FrontierSmith targets. Frontier labs may already understand the value of open-ended optimization, but without enough scalable training tasks, it is hard to run the kind of training that made closed-ended coding models so strong.<br>From Closed to Open<br>The core idea of FrontierSmith is simple: do not ask an LLM to invent high-quality open-ended problems from scratch. Start from closed-ended problems instead.<br>Closed-ended coding tasks are already abundant. Given a LeetCode-style or competitive-programming-style problem, FrontierSmith applies principled mutations that turn it into a high-quality open-ended optimization problem.
FrontierSmith starts from closed-ended seed problems, mutates them into open-ended candidates, filters them by solution diversity, and builds runnable training environments.<br>We describe a problem as a tuple: an objective (O), input constraints (C_I), and output constraints (C_O). FrontierSmith mutates the formulation along three axes:<br>Changing the goal ((O \rightarrow O’)): replace an exact or binary goal with an optimization objective.<br>Restricting the output ((C_O \rightarrow C_O’)): add constraints that make exact solutions infeasible at scale.<br>Generalizing the input ((C_I \rightarrow C_I’)): relax input assumptions so a previously tractable problem becomes hard.
FrontierSmith mutates problem formulations by changing goals, restricting outputs, or generalizing inputs.<br>A simple example is minimum spanning tree. The original problem has a clean greedy solution. If we add a degree constraint and require every vertex in the tree to have degree at most (D), the problem becomes a degree-constrained spanning tree problem. When (D = 2), this reduces to a classic TSP-like setting. At realistic scales, exact optimality is no longer practical, and solution quality becomes continuous.<br>Filtering for Real Diversity<br>Mutation creates many candidates, but not every candidate is useful. Some are still effectively closed-ended. Some are open-ended in wording but dominated by one obvious strategy.<br>Our key filtering signal is idea divergence . We cannot ask an LLM to prove whether a problem is P or NP-hard, or whether an optimum is reachable under a fixed compute budget. We can, however, sample solutions from different solvers and ask whether they explore meaningfully different algorithmic ideas.<br>Open-ended problems tend to produce diverse solution strategies. Closed-ended problems are often dominated by a single “gold idea.”
Idea divergence is estimated both semantically, by comparing solver strategies, and behaviorally, by comparing score vectors across generated tests.<br>FrontierSmith uses this signal in two stages. Before building an execution environment, an LLM judge compares sampled solutions and estimates whether they use the same strategy. After tests and verifiers are built, FrontierSmith also compares solution score vectors across test cases. Candidates with low divergence are discarded.<br>Building Training Environments<br>The surviving ideas are converted into clean, runnable training environments. We reuse the...