Making Optimization Work When Labels Are Scarce

Making Optimization Work When Labels Are Scarce · Gnosys Labs

Become a design partner

HOME · CASE STUDIES · SAFETY · SPARSE LABELS

Case study · Early evidence · As of 2026-06

Gnosys is an autonomous model engineer: it improves prompts and classifiers when ground truth is too sparse for conventional optimization. On ToxicChat, a public safety benchmark, under realistic label scarcity, it improved a classifier past both the team's starting point and GEPA (a standard prompt optimizer), across two runs of our current method. This note describes what we did, what we found, and where the method underperformed.

Summary. We evaluated Gnosys on ToxicChat under realistic label scarcity (about 200 verified labels, only about 8 harmful). Using the same underlying optimizer, we compared running GEPA directly against the labels with the Gnosys system, which engineers a trustworthy objective before improving the model. Across two held-out runs, Gnosys outperformed both the team's starting classifier and GEPA on the metric safety teams actually deploy against: harm caught at a fixed false positive budget. These are early results (two single-seed runs); replication is underway.

Results

We report harm caught: the share of harmful messages flagged, holding the false positive rate fixed at 5% (one in twenty) for every method, so a difference reflects additional harm caught at the same cost rather than a change of threshold. Both runs below are scored on a held-out set the system never saw.

Headline run (3,000) Prior run (1,000) Gnosys 0.777 0.909 Starting classifier 0.731 0.788 GEPA 0.702 0.848 In both runs, Gnosys improved on both the starting classifier and GEPA. In the headline run GEPA not only trailed Gnosys but fell below the starting classifier (0.731 to 0.702); in the prior run it improved on the starting point. This inconsistency is the central difficulty under sparse labels: optimization sometimes helps and sometimes harms, and without trustworthy measurement there is no way to tell which has happened.

The comparison is intentionally conservative: both approaches use the same underlying optimizer. The only difference is that Gnosys engineers the objective the optimizer works against.

The problem

Teams running high-stakes AI classifiers, in content moderation, fraud, claims review, and risk scoring, share one constraint: the ground truth they need is a human judgment that is expensive, slow, and sometimes never arrives. They can verify only a small set of examples while decisions accumulate on everything else.

Tuning the model against the few labels on hand is where the difficulty concentrates. Here "few" is literal: about 200 verified labels, of which roughly 8 were actual harm, against several thousand unlabeled messages. With that little verified signal, an optimizer fits the noise in those examples rather than the underlying pattern, and the direction it moves depends on which handful of labels it happened to receive.

How Gnosys is different

GEPA improves whatever evaluation signal it is given. That is its job, it does it well, and Gnosys uses it. But Gnosys goes further. As an autonomous model engineer it judges whether the available signal is trustworthy enough to optimize against, engineers a better objective from the sparse labels when it is not, and rewrites the prompts and classifier against that objective.

Prompt optimization is one step in the loop. Gnosys automates the entire engineering cycle.

Rather than trusting a handful of labels directly, Gnosys fuses the small verified set with the large unlabeled pool into a calibrated estimate of quality, with per-slice calibration and an explicit check that flags when the signal is not trustworthy enough to act on. In both runs, optimizing against that calibrated objective improved on both the starting classifier and GEPA using the same labels.

The evidence, slice by slice

The figures below are computed against the held-out test labels, full ground truth a deployment would not have. They are point estimates on small positive subsets, so we report the count alongside each, and they are not estimates the system produced from the sparse labels. Because a single aggregate can hide a regression within a category of interest, we report every slice, including losses. All figures compare Gnosys against GEPA on the headline run.

By message length (a complete split of the test set):

Length Harmful examples vs. GEPA

Short (under ~80 characters) 81 −18.5 pts

Medium 51 +21.6 pts

Long / multi-step (200+ characters) 106 +20.8 pts

By harmful-content category (a safety team's working slices):

Category Harmful examples vs. GEPA

Violence-related 21 +23.8 pts

Jailbreak attempts (independently verified) 49 +8.2 pts

Sexual content 63 −7.9 pts

The gains concentrated where judging the content requires the most reasoning: violent intent, deliberate jailbreaks, and...

Making Optimization Work When Labels Are Scarce

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI