Agent performance was never the hard part

Agent performance was never the hard part | Jack Holt

As a Principal Engineer at Faculty, I spent a year building an agent harness for large enterprises in regulated domains. Our goal was production-level performance. Details are anonymised.

Agent performance was never the hard part. Persisting and linking their outputs, decomposing them along organisation lines, making them resilient to the real world, and pushing performance through the human-accuracy ceiling are the hardest parts.

Our setup

We used a lightweight, domain-agnostic data model: every agent emits steps : a record of actions it actually took (e.g., “read the document”, “emailed the broker”). A run is a single invocation: an ordered set of steps where success is asserted by the agent, not inferred by the harness (“email server was down, run failed”). A task represents the business outcome such as “insurance quote complete” and it owns many runs. The run history is effectively a long-horizon resume for the task; an insurance quote that takes three weeks of broker correspondence is one task and N runs, each picking the workflow up with new information rather than resuming a paused process.

Lightweight, domain-agnostic data model.

The harness holds no domain knowledge: “get a quote by doing X” never enters it. We built customisation around three consumers: a builder with the domain knowledge who writes the agent, a dashboard user who turns its outputs into business value, and a manager who owns the production bar. The harness connects them using the task/run/step model. Customisation is convention-based discovery, not configuration: an opinionated library: Next.js for enterprise agents.

We prioritised first-class support for eval-driven development. The production bar in a regulated domain is near-human accuracy with audit-grade reliability, so roughly 75% of the engineering went into quantifying agent performance rather than building agents. When running evaluations is low-friction, implementing agent improvements is intuitive and easy.

Observation 1: Agent design is org design

A valuable business decision is rarely one agent’s output. “Offer this insurance quote” requires lots of complex inputs: N teams each own a sub-determination, and the decision exists only once their results compose. Each team has the domain knowledge to build its own agent; none of them should need to understand the others’ to ship. Friction maps onto team boundaries. Conway’s Law in action: the system was going to mirror the organisation whether we designed for it or not. The choice was whether that mirroring was explicit and safe, or implicit and a source of coupling.

We made it explicit by modelling the dependency structure as what it already is: a DAG. Each agent is a node, each cross-team dependency is an edge. The organisation’s decision graph and the agent graph are the same graph.

On a successful run the harness derives a trigger event from the output and records it. A downstream agent subscribes to the triggers it depends on; when one fires, it hydrates its state from the upstream task in the shared model and proceeds. No message bus, no new orchestration tier: the edge is a row. To keep the harness’s infrastructure footprint light, trigger state lives as one collection in an existing database: a lookup table, task X → trigger → task Y.

Two properties fall out of this structure rather than being built on top of it.

The first is provenance . Because every edge is a recorded task → trigger → task row, the causal chain behind any output is queryable. In a regulated domain that is not a convenience: “why did the agent do this” has to be answerable, and the trigger table makes the answer a queryable artefact instead of something reconstructed from logs after the fact.

The second is failure isolation . A trigger is emitted only on successful completion, so a failed upstream run simply produces no edge. Downstream agents don’t receive a corrupt input to defend against; they receive nothing, and don’t run. Cascading failure is prevented not by error handling but by the absence of the trigger: upstream success is a structural precondition of the edge, not a check performed afterwards. We can still query the negative space to answer questions like “why didn’t the agent run”.

A team could build, evaluate, and ship its agent against a single contract: emit a trigger on success, subscribe to the triggers you depend on. There is no synchronous coordination with any other team. The coordination problem was moved into a primitive small enough that nobody had to negotiate it.

Our leverage came from providing a harness that let builders stop fighting their organisational shape.

Organisation dependencies modelled as a DAG: agents are nodes, triggers are edges.

Observation 2: The world is unreliable, so the agent must be retryable

The internal APIs these agents had to call were built to serve humans through dashboards, not machines through automation. They were...

Agent performance was never the hard part

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast