Mining a Terms-of-Service fairness rubric from labelled data with DSPy and GEPA

GEPA wrote its own legal rubric and caught 33% more unfair contract clauses | by Tassos Yalanopoulos | Empirical Engineer | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Empirical Engineer

Methodical exploration of AI tools: proof-of-concept projects you can follow, question, and reproduce. Code and numbers open on GitHub.

GEPA wrote its own legal rubric and caught 33% more unfair contract clauses

Tassos Yalanopoulos

7 min read· 7 hours ago

Listen

A bare one-line prompt caught 65% of the unfair clauses in a stack of contracts. After an automatic optimiser rewrote that prompt, it caught 87% . That optimiser is GEPA . You hand it a task, labelled examples, and a way to score outputs. It evolves the prompt for you. The model doing the classifying the whole time? Claude Haiku . Cheap and small. The lift came from a better prompt, not a bigger model. Everything below runs on public data, in a small repo: github.com/anastasiosyal/dspy-gepa-optimizer The key dependency is the open source DSPy framework Case Study: Is this contract clause unfair? The task involves reading one clause from a Terms-of-Service contract and deciding whether it’s unfair to the consumer (LexGLUE unfairToS). The dataset contains real clauses labelled by legal experts. “Unfair” follows specific legal criteria a general model doesn’t reliably know: unilateral termination, price-change-at-will, forced arbitration, choice of foreign law, broad content licences. I balanced the data 50/50 so accuracy is honest. The raw unfair tos data is heavily skewed (~89% fair / 11% unfair), so a lazy model that always says “fair” would score 89% and look great while catching zero violations. Balancing 50/50 makes accuracy honest. Experiment Setup The starting point is a deliberately bare prompt: “Decide whether this Terms-of-Service clause is unfair to the consumer.” No criteria. Let it discover them. Press enter or click to view image in full size

The clauses are split three ways, each sampled balanced 50/50 and non-overlapping: 200 train, what GEPA optimises on, 120 validation, what it scores candidate prompts against, mid-run and 300 test locked away for the final number. The test set isn’t touched until the run is over, so every headline figure is measured on clauses GEPA never saw. The split is fixed for every run, so the four runs are directly comparable. Baseline: 77.7% accuracy. But only 65% recall on the unfair class. It was missing a third of the violations. Reading those misses showed what it lacked. In some examples, it didn’t know that “governed by the laws of the Netherlands”, “we may discontinue our services”, or “may update pricing at any time” are unfair. It had intuition, not the rubric. The Impact of Reflective Evolution With GEPA, misses are fed back as targeted feedback. Press enter or click to view image in full size

Baseline vs average & best runsFrom a one-line prompt, GEPA added clauses to the rubric. It wrote them as general rules, not memorised specifics, which is what makes them transferable. The three clauses the baseline missed each map to a mechanism GEPA added: “governed by the laws of the Netherlands” gets translated into a new line in the prompt for a generic choice-of-law rule: “a clause stating the terms ‘shall be governed by … the law of [jurisdiction] … binds the consumer to a particular legal regime, which disadvantages the consumer." Note the [jurisdiction] placeholder, it never memorised "Netherlands." “we may discontinue our services” and “may update pricing at any time” gets translated into a generic unilateral-change rule: “the provider can unilaterally change/modify/cancel terms, services, pricing … unfair even if the clause promises notice.” The rubric never names a company or a country, it names principles. So, the gain is the rubric generalising, not the model memorising the test. The optimiser discovered the decision criteria and encoded them , and the violation-catch went from 65% to 86.5% on average across 4 runs (91% on the best run) That’s a 33% relative improvement without a human having to add criteria to a prompt manually, we get AI to do it. How Does this Work with DSPy? DSPy is the framework: you declare your task as a typed program instead of writing prompt strings. GEPA is one of DSPy’s optimisers: it takes that declared program and rewrites its prompt for you. Gepa is also an open source project of its own (gepa-ai/gepa: Optimize prompts, code, and more with AI-powered Reflective Text Evolution) A dspy.Signature declares the typed inputs and outputs, and its docstring is the instruction the model runs on. A dspy.Module (ChainOfThought here) makes it runnable: Verdict = Literal["fair", "unfair"]

class ClauseFairness(dspy.Signature): """Decide whether this Terms-of-Service clause is unfair to the consumer."""

text: str = dspy.InputField( desc="A single clause from an online Terms-of-Service contract.") label: Verdict =...

Mining a Terms-of-Service fairness rubric from labelled data with DSPy and GEPA

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs