Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding

victormustar2 pts0 comments

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding | DeepReinforce Blog | Jun. 2026

Aloha! 🌺

Today, we are introducing Ornith-1.0 , a self-improving family of open-source models specially for agentic coding tasks. Ornith-1.0 spans the full spectrum,<br>from compact 9B Dense models suitable for edge device deployment to 397B MoE frontier-scale models optimized for maximum performance, with variants including<br>9B Dense, 31B Dense, 35B MoE, and 397B MoE . Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models<br>of comparable size on coding benchmarks.

The key innovation behind Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0<br>learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts. By jointly optimizing the scaffold and the resulting solution, the model<br>can discover better search trajectories and generate higher-quality solutions.

Ornith-1.0 achieves state-of-the-art performance among open-source models of comparable size across a broad range of agentic coding benchmarks: Ornith-1.0-397B (77.5<br>on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified) matches the performance of Claude Opus 4.7 (70.3 on TB-2.1 and 80.8 on<br>SWE-Bench Verified) and outperforming leading open-source models of similar size, including MiniMax M3 (66.0 on TB-2.1 and 80.5 on SWE-Bench<br>Verified) and DeepSeek-V4-Pro (67.9 on TB-2.1 and 80.6 on SWE-Bench Verified). Ornith-1.0-9B, which can be easily deployed on edge devices,<br>matches or exceeds the performance of much larger models such as Gemma 4-31B and Qwen 3.6 35B.

At the flagship scale, Ornith-1.0-397B achieves 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, surpassing Claude Opus 4.7 on<br>both benchmarks and outperforming leading open-source models of similar size, including Minimax M3 and DeepSeek-V4-Pro.

Ornith-1.0-35B significantly outperforms similarly sized models, including Qwen 3.5-35B, Qwen 3.6-35B, and Gemma 31B. Despite having only 35B parameters, it even surpasses<br>Qwen 3.5-397B on Terminal-Bench 2.1 (64.4 vs. 53.5) while matching its performance across several other coding and agentic benchmarks.

The edge-deployable Ornith-1.0-9B also delivers remarkably strong results, achieving 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench<br>Verified. Despite being a compact 9B-parameter model, it matches or exceeds the performance of much larger models such as Gemma 4-31B, demonstrating that strong agentic<br>coding capabilities can be achieved even in resource-efficient deployments.

A Self-improving Strategy for LLM Training

At the core of Ornith-1.0 is a self-improving training framework that jointly learns to solve tasks and to construct the scaffolds that guide those solutions. Rather than<br>relying on a fixed, human-designed harness shared across a task category, Ornith-1.0 treats the scaffold as a learnable object that co-evolves with the policy.

Each RL step proceeds in two stages: conditioned on a task and the scaffold previously used for it, the model first proposes a refined scaffold; conditioned on that scaffold<br>and the task description, it then generates a solution rollout. Reward from the rollout is propagated to both stages, so the model is optimized not only to produce better<br>answers but to author the orchestration that elicits them.

Repeated over training, this yields a feedback loop in which scaffolds are continually mutated and selected toward those that induce higher-reward trajectories, allowing<br>per-task-category strategies to emerge automatically and driving sustained capability gains without hand-engineered harness design.

Addressing Reward Hacking in Self-improvement

Allowing the model to author its own scaffold naturally introduces the reward-hacking issue. A self-generated scaffold can learn to satisfy the verifier without performing the<br>task: reading the visible test files and hardcoding the expected artifacts, such as touching the checked-for file or writing the literal expected output, or copying an oracle<br>solution present in the environment.

We defend against this in three layers. First, we fix the outer trust boundary: the environment, the tool surface, and test isolation are immutable and outside the model's<br>reach, so the model evolves only the inner policy scaffold: its memory, error-handling, and orchestration logic.

Second, a deterministic monitor enforces that boundary at the level it can be specified exactly, flagging any attempt to read withheld paths, modify verification scripts, or<br>invoke actions outside the sanctioned tool surface, and assigning such trajectories zero reward with exclusion from the advantage computation.

Third, because intent-level gaming can occur entirely within the allowed tool surface, a frozen LLM judge acts as a veto on top of the verifier rather than the primary reward.

Asynchronous RL Training

For RL training,...

ornith models bench self scaffold coding

Related Articles