Matching frontier performance through training and harness engineering

Fireworks AI

Serverless 2.0 is live: control reliability & speed without reserved capacity. Get Started.

Get Started

Blog

Open Source Agents Frontier Advisors<br>Open-source agents with frontier advisors: matching frontier performance through training and harness engineering

PUBLISHED 6/3/2026

Table of Contents

Combining an open-source agent harness, frontier tool use, and Fireworks-native post-training lifts performance through system-level orchestration.

The Test

Open source is competitive on quality, dominant on cost

A hybrid harness: open-source worker, frontier advisor as a callable tool

Post-training on Fireworks

Legal Agent Benchmark

How LAB scores a model

Our first step with Harvey in post-training a frontier-scale model

Table of Contents

Combining an open-source agent harness, frontier tool use, and Fireworks-native post-training lifts performance through system-level orchestration.

The Test

Open source is competitive on quality, dominant on cost

A hybrid harness: open-source worker, frontier advisor as a callable tool

Post-training on Fireworks

Legal Agent Benchmark

How LAB scores a model

Our first step with Harvey in post-training a frontier-scale model

Table of Contents

Combining an open-source agent harness, frontier tool use, and Fireworks-native post-training lifts performance through system-level orchestration.

TL;DR. We explore two system-level techniques on Harvey’s Legal Agent Benchmark that reduce reliance on single frontier model calls while reaching the frontier-level performance at lower cost.

Harness engineering : an open-source GLM 5.1 worker self-triggers Claude Opus 4.7 as a callable advisor on sub-tasks where it improves outcomes, reaching 18 / 100 all-pass at $368 versus 14 / 100 for Opus end-to-end at $954.

Post-training on Fireworks : supervised fine-tuning (SFT) of Kimi K2.6 on LAB trajectories reaches 15 / 100 all-pass at $84, while reinforcement fine-tuning (RFT) improves mean score from 0.863 to 0.886 across 46 rollout steps.

Both approaches run on the Fireworks platform used for training and serving, removing the traditional gap between experimentation and production.

“On Fireworks, combining open-source worker models with frontier tool use and post-training closes much of the gap to frontier performance on Legal Agent Benchmark, while improving cost efficiency and system controllability.” — Niko Grupen, Head of Applied Research at Harvey<br>Figure 1: All-pass / 100 vs. total cost across the configurations we ran on the 100-task LAB slice: Claude Opus 4.7 (closed baseline), Kimi K2.6 (base and SFT), GLM 5.1 (worker only), and GLM 5.1 + Opus 4.7 advisor. The harness adds 6 tasks fully passing on top of GLM 5.1 alone and 4 above Opus, at $368 across the 100 tasks. That’s about 3× higher than GLM 5.1 alone on the worker side, still ~39% of Opus’s $954 standalone cost. GLM 5.1 + Opus 4.7 advisor beats Claude Opus on cost and quality. Harness all-pass vs. cost All-pass standard error ≈ 2.5pp (≈2.5 tasks / 100) across re-runs of identical configurations on the 100-task slice.The Test

As a Harvey LAB research partner, Fireworks took an initial 100-task slice and ran it across the most capable open-source and closed-source models, then layered in the two interventions we think the field has been under-investing in: a hybrid harness with an open-source worker and frontier advisor, and Fireworks-native post-training capabilities.

The 100-task slice is a distribution-mirrored subset of the 1,250-task LAB release, preserving the practice-area mix of the full benchmark. This mirrors the sampling approach Harvey used for the Initial Results in the launch post.

The exercise was necessary because intelligence is jagged: a model that nails frontier mathematics or competitive code generation can still struggle with structured legal drafting, and there is no shortcut around domain-specific evaluation. LAB is the cleanest public lab we know of for the question the industry has been arguing about for two years:

can open-source models do frontier-quality legal AI?

The joint team’s setup runs both halves of the answer on one platform: Fireworks trains, evaluates, and serves on the same infrastructure, so a model fine-tuned against LAB is the same model, bit-for-bit, that serves production traffic. No research-to-production gap to cross.<br>Open source is competitive on quality, dominant on cost

On LAB’s continuous mean-score metric, GLM 5.1 ranks highest among the open-source models we evaluated, at 0.8921 mean score putting it directly alongside frontier: Claude Opus 4.7 at 0.911, GPT-5.5 at 0.892. Kimi K2.6 (0.863) and DeepSeek V4 Pro (0.871) come in just below, both still clearly viable for production legal workloads.

On the LAB all-pass metric, the production-readiness measure, the closed frontier holds a small lead: Opus 4.7 at 14 / 100, GPT-5.5 at 11 / 100, GLM 5.1 at 12 / 100 . That gap is where the rest of this post lives; the two...

Matching frontier performance through training and harness engineering

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs