RedlineBench: how models handle a multi-turn, real world contract negotiation

Crosby · micro1 RedlineBench | Crosby Intelligence 1. Why we built this Contract redlining is judgment-dense and strategically complex. It is closer to poker than to math. There are early game moves, countermoves, tradeoffs, and end games, all undertaken with incomplete information. Party leverage, business priorities, counterparty tolerance, and the value of each move are often uncertain. There is rarely a single right move. That creates two challenges for benchmark design. First, the benchmark has to reflect the complexity of the workflow itself. A useful redline is not just a legally correct clause edit. It is a move in a negotiation, shaped by deal context, party posture, timing, and the need to preserve momentum toward execution. Second, the judgment behind a strong redline is often only implicit in the attorney’s work product. A golden response may show what an attorney changed, but not why the issue mattered or how the attorney weighed it against the rest of the negotiation. A useful benchmark therefore has to preserve attorney output in its native form while also converting the most important redlining judgments into structured evaluation criteria. The Crosby-micro1 RedlineBench is designed around those challenges. It uses multi-turn SaaS MSA negotiations, document-native redlines, attorney-authored golden responses, and rubrics tied to the decisions attorneys considered most important at each stage. By collecting data in the real workflow of contract negotiation, the benchmark evaluates models as negotiation participants, not merely drafting assistants.

2. Summary of Findings GPT-5.5 has the highest overall turn-weighted rubric score, but the spread across models is narrow, suggesting that the benchmark remains challenging across the frontier model set. 01Issue prioritization

Issue prioritization is a shared weakness. Models struggle to identify the issues attorneys collectively treat as most important, especially when initiating redlines on a clean template.

02Over-acceptance

Models exhibit a systematic over-acceptance bias when forced to accept or reject counterparty redlines. This pattern suggests that models lack a genuine understanding of the commercial stakes behind redlined terms and instead default to agreement regardless of substance.

03Surgicalness

Claude Fable 5 leads on surgicalness. Among the models, Fable 5 comes closest to attorney drafting behavior, with the lowest reliance on block edits and the shortest average edit length.

04The gap

Current models remain meaningfully short of attorney-grade redlining. The gap is not limited to legal correctness. Models remain weaker on strategic issue selection, vendor-side commercial judgment, drafting precision, and adaptive position management across turns.

3. Designing RedlineBench

3.1 Simulating Multi-turn SaaS MSA Negotiations The benchmark is structured as a multi-turn simulation rather than a set of isolated redlining tasks. Each negotiation proceeds through four alternating attorney turns, requiring each side to respond to the evolving contract, prior counterparty redlines, and its own legal and commercial objectives. This design allows attorneys to develop granular rubrics that capture how positions shift, tradeoffs are managed, and deal strategy develops over the course of a negotiation, rather than offering only a static view of redlining judgment. The SaaS MSA scenarios operationalize this design through three simulated technology transactions involving AgentCo, a Series A HR technology company offering an AI-powered product called TalentFlow, and larger enterprise counterparties. The scenarios share a common commercial foundation but vary the initiating document, negotiating posture, and deal stakes to test how redlining decisions change under different transaction conditions. 01Scenario 1

Scenario 1 begins with LargeCo sending its SaaS MSA template to AgentCo. AgentCo reviews the customer-side template and initiates the first round of redlines.

02Scenario 2

Scenario 2 reverses the paper. AgentCo sends its own SaaS MSA template, and LargeCo initiates the first round of redlines.

03Scenario 3

Scenario 3 increases both the complexity and commercial pressure of the transaction. The deal is scaled from a pilot to a production deployment that is approximately ten times larger in size, and AgentCo is instructed that the contract is a must-win opportunity. Instead of receiving a clean SaaS MSA, AgentCo receives a services agreement from GiantCo and must adapt it to fit the SaaS transaction while avoiding excessive redlining that could jeopardize the deal.

Across these scenarios and turns, the attorney-authored redlines and corresponding rubrics create the basis for evaluating model outputs against the legal, commercial, and strategic judgments reflected in the simulated negotiations.

3.2 Evaluation Dimensions The five evaluation dimensions provide the organizing framework for scoring model-generated redline outputs using...

RedlineBench: how models handle a multi-turn, real world contract negotiation

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews