Making Optimization Work When Labels Are Scarce · Gnosys Labs
Become a design partner
HOME ·<br>CASE STUDIES ·<br>SAFETY · SPARSE LABELS
Case study · Early evidence · As of 2026-06
Making Optimization Work When Labels Are Scarce
Gnosys is an autonomous model engineer: it improves prompts and classifiers when ground truth is too sparse for conventional optimization. On ToxicChat, a public safety benchmark, under realistic label scarcity, it improved a classifier past both the team's starting point and GEPA (a standard prompt optimizer), across two runs of our current method. This note describes what we did, what we found, and where the method underperformed.
Summary. We evaluated Gnosys on ToxicChat under realistic label scarcity (about 200 verified labels, only about 8 harmful). Using the same underlying optimizer, we compared running GEPA directly against the labels with the Gnosys system, which engineers a trustworthy objective before improving the model. Across two held-out runs, Gnosys outperformed both the team's starting classifier and GEPA on the metric safety teams actually deploy against: harm caught at a fixed false positive budget. These are early results (two single-seed runs); replication is underway.
Results
We report harm caught: the share of harmful messages flagged, holding the false positive rate fixed at 5% (one in twenty) for every method, so a difference reflects additional harm caught at the same cost rather than a change of threshold. Both runs below are scored on a held-out set the system never saw.
Headline run (3,000) Prior run (1,000)<br>Gnosys 0.777 0.909<br>Starting classifier 0.731 0.788<br>GEPA 0.702 0.848<br>In both runs, Gnosys improved on both the starting classifier and GEPA. In the headline run GEPA not only trailed Gnosys but fell below the starting classifier (0.731 to 0.702); in the prior run it improved on the starting point. This inconsistency is the central difficulty under sparse labels: optimization sometimes helps and sometimes harms, and without trustworthy measurement there is no way to tell which has happened.
The comparison is intentionally conservative: both approaches use the same underlying optimizer. The only difference is that Gnosys engineers the objective the optimizer works against.
The problem
Teams running high-stakes AI classifiers, in content moderation, fraud, claims review, and risk scoring, share one constraint: the ground truth they need is a human judgment that is expensive, slow, and sometimes never arrives. They can verify only a small set of examples while decisions accumulate on everything else.
Tuning the model against the few labels on hand is where the difficulty concentrates. Here "few" is literal: about 200 verified labels, of which roughly 8 were actual harm, against several thousand unlabeled messages. With that little verified signal, an optimizer fits the noise in those examples rather than the underlying pattern, and the direction it moves depends on which handful of labels it happened to receive.
How Gnosys is different
GEPA improves whatever evaluation signal it is given. That is its job, it does it well, and Gnosys uses it. But Gnosys goes further. As an autonomous model engineer it judges whether the available signal is trustworthy enough to optimize against, engineers a better objective from the sparse labels when it is not, and rewrites the prompts and classifier against that objective.
Prompt optimization is one step in the loop. Gnosys automates the entire engineering cycle.
Rather than trusting a handful of labels directly, Gnosys fuses the small verified set with the large unlabeled pool into a calibrated estimate of quality, with per-slice calibration and an explicit check that flags when the signal is not trustworthy enough to act on. In both runs, optimizing against that calibrated objective improved on both the starting classifier and GEPA using the same labels.
The evidence, slice by slice
The figures below are computed against the held-out test labels, full ground truth a deployment would not have. They are point estimates on small positive subsets, so we report the count alongside each, and they are not estimates the system produced from the sparse labels. Because a single aggregate can hide a regression within a category of interest, we report every slice, including losses. All figures compare Gnosys against GEPA on the headline run.
By message length (a complete split of the test set):
Length<br>Harmful examples<br>vs. GEPA
Short (under ~80 characters)<br>81<br>−18.5 pts
Medium<br>51<br>+21.6 pts
Long / multi-step (200+ characters)<br>106<br>+20.8 pts
By harmful-content category (a safety team's working slices):
Category<br>Harmful examples<br>vs. GEPA
Violence-related<br>21<br>+23.8 pts
Jailbreak attempts (independently verified)<br>49<br>+8.2 pts
Sexual content<br>63<br>−7.9 pts
The gains concentrated where judging the content requires the most reasoning: violent intent, deliberate jailbreaks, and...