Faster inference won't save you

ramstar30002 pts1 comments

Faster inference won't save you - Graphcoder

We went into Graphcoder assuming agent latency was mainly an inference problem.

That lasted until we watched real sessions run. The obvious stalls were not the model thinking. They were the gaps around each turn. A tool would finish, then its result had to get back across the user's connection before the loop could decide what to do next. On a good connection that was annoying. On hotel Wi-Fi it was the product.

OpenAI's WebSocket mode for the Responses API was the first hint that this mattered. Same inference, but OpenAI reported 40%+ lower end-to-end latency on long workflows. We treated that as the starting point.

WebSocket mode makes the transport faster. It does not remove the round trip around each tool call, and those round trips add up:

Path<br>Round trip per turn<br>20-turn task

Cloud to cloud, same region<br>~2ms<br>~40ms

Laptop on good fiber<br>~80ms<br>~1.6s

Hotel or airplane Wi-Fi<br>up to 800ms<br>up to 16s

For us, that made the next step hard to ignore: if transport alone could make workflows feel that different, the loop itself shouldn't be trapped behind the laptop's request-response path.

The log is the state

Moving the loop off the laptop buys back latency, but it also breaks the thing local agents quietly rely on: process memory.

A local agent can be simple. The model responds, the agent parses, a tool runs, the result is appended, and the model gets called again. The whole loop lives in one place, so state can hide in ordinary objects, files, buffers, and whatever happens to still be reachable on the next line of code.

Once the loop spans machines, local memory stops being state. Messages, tool calls, file edits, approvals, failures, retries, and partial progress all have to survive reconnects and restarts. Process memory becomes a cache.

Graphcoder keeps durable history first and derives state from it. Formally, that history is an append-only log:

L=⟨e1,e2,…,en⟩L = \langle e_1, e_2, \ldots, e_n \rangleL=⟨e1​,e2​,…,en​⟩

An event carries what happened and what it depends on:

e=(id,type,payload,deps)e = (\text{id}, \text{type}, \text{payload}, \text{deps})e=(id,type,payload,deps)

A tool result is not state. It is one record in the history. So is a user message, a file edit, or a worker finishing. State is what you get after replaying the prefix you have seen:

St=P(L[1..t])S_t = P(L[1..t])St​=P(L[1..t])

where PPP is the projection for the view you care about.

The same history can produce different views. The UI reads it one way. The filesystem reads it another way. The agent reads it into the context it needs for the next model call. One source of truth, several projections.

One log, several projections. UI state, filesystem state, and agent state are different reads of the same history.

Sharding the log

The first version of this idea is one log. That is the design you want if it holds: one append path, one replay order, one place to debug.

It does not hold for long. Once many agents are running, unrelated events start queueing behind the same sequence number. The log becomes a bottleneck.

So Graphcoder shards the log by owner:

partition(e)→k\text{partition}(e) \rightarrow kpartition(e)→k

Each shard is still append-only:

Lk=⟨ek,1,ek,2,…⟩L_k = \langle e_{k,1}, e_{k,2}, \ldots \rangleLk​=⟨ek,1​,ek,2​,…⟩

That fixes writes. It also means reads can no longer say "replay the log." There is no single log anymore.

Instead, a projection starts from the event it cares about and follows dependencies backward:

C(e)=closure({e},deps)C(e) = \text{closure}(\{e\}, \text{deps})C(e)=closure({e},deps)

A read is valid when that slice is closed:

∀e∈C, deps(e)⊂C\forall e \in C,\ \text{deps}(e) \subset C∀e∈C, deps(e)⊂C

So the projection gets the history it needs, not the history that merely happened nearby. That is the reason sharding works for an event log rather than turning every read into a distributed replay.

Highlighted events are the dependency closure for one root event. Gray events are real history, but this read does not need them.

Running ahead of authority

Once the log is authoritative, the easy mistake is making the UI wait for it.

Without speculation, every user action takes the slow path: client to authority, authority back to client, then paint. That is correct, but it spends a round trip before showing the user something the client already knows.

If the client has seen prefix LcL_cLc​ and submits intent iii, it can render:

S′=P(Lc+speculative(i))S' = P(L_c + \text{speculative}(i))S′=P(Lc​+speculative(i))

Graphcoder treats local intent as a temporary tail on confirmed history. The UI stays explicit about the difference: confirmed prefix first, pending intent at the end.

The client projects past the authority cursor. The speculative tail eventually hardens, rebases, or rolls back.

The log is still the authority. When the authority catches up, the tail hardens into history, rebases over a different fact, or rolls back.

That...

history state deps text authority graphcoder

Related Articles