I built a real-time trading platform with an AI. Then the clock started lying. - Farid Saïd
In my previous article, I described giving an AI read-only SSH access to my production infrastructure: auditing, documenting, monitoring. The logical next step was to let it not just read, but build.
So I did. With Claude as a co-pilot, I built a real application: a price-matching platform for OTC interest-rate swaps . Real-time, WebSocket, SSO authentication, an engine that automatically pairs the complementary interests of several counterparties. FastAPI, asyncio, SQLite. Shipped to production with a pilot broker and around fifteen bank counterparties behind him.
And it worked. For a month.
Then the broker messaged me: “The session cut off while there was still time on the clock.” And also: “A trader's notionals are flickering, appearing and disappearing.” And: “The timing is weird, the sessions run too long.”
My clock had started lying. Here's how I found out why, and what it taught me about the word “asynchronous.”
The symptom: sessions that don't end on time
The heart of the platform is a fixed-duration matching session. You configure it for, say, ten minutes: a countdown, then a matching window, then it's over. Everyone sees the same clock.
Except no. In production, a session configured for 600 seconds lasted 1202 (double). Another, 1803 (triple). Going back through about forty sessions, I didn't find a clean factor of two. I found this:
1.0×, 1.2×, 1.41×, 1.5×, 2.0×, 2.5×, 2.67×, 3.0×, 3.5×… up to 5.51×.
A session meant to last ten minutes could run for almost an hour. And nobody understood why, me first.
The first wrong turn: “it's the timer”
The obvious reflex: the timer code has a bug. It double-arms, it forgets to stop, something.
I isolated and tested it on its own. Two-second countdown, three-second matching: exact, a single transition, no double-arming. The timer was perfect in isolation.
Worse (or better): in production, the countdown was exact : 30 seconds, dead on. Only the matching phase drifted. What's the difference? During the countdown, almost nothing happens. During matching, it's alive: orders arrive over WebSocket, every client polls the API every three seconds, a price feed pushes data several times a second, and the server broadcasts state to everyone continuously.
The timer wasn't broken. It was starved .
The clue that unlocked everything: the shape of the error
Here's the detail that flipped the diagnosis, and it has been my rule ever since:
The shape of your error distribution tells you what kind of bug you have.
A discrete bug, a timer that resets, that counts twice, produces discrete errors: clean integer multiples. ×2, ×3, never ×2.67. But I didn't have integer multiples. I had a continuum : 1.2×, 1.41×, 2.67×, 3.5×, 5.51×. A continuous ramp.
And a continuous ramp doesn't look like a logic bug. It looks like contention : the more load, the slower, proportionally. The stretch factor tracked the session's activity. From there, I was no longer hunting a timer bug. I was hunting whatever blocked the loop.
The cause: one slow client blocked everyone
asyncio runs on a single thread . The whole server, the timer, the orders, the broadcasts, the WebSocket heartbeats, shares one event loop. That loop is cooperative: until a piece of code yields (with an await that actually cedes control), nothing else runs.
My timer counted await asyncio.sleep(1) calls. In theory, each loop = one second. In practice, sleep(1) only resumes when the loop has time to call it back. If the loop is busy elsewhere, every “second” of the timer lasts one second plus the lag . Count enough late loops, and your ten-minute session runs fifty.
The main culprit: the broadcast function. It was declared async all right, but it sent to each client sequentially, one after another, with no timeout , and it was awaited on every timer tick and every trade.
It took just one slow or half-dead client , a frozen tab, an expired token, a full TCP buffer on the network side, for the send to that client to block the entire broadcast loop. And as long as that loop blocked, the timer tick waited. And so did every other client's heartbeat.
One broken client, and the whole floor slows down.
The moment it all clicked
The beauty of a real root cause is that it doesn't explain one symptom. It explains all of them.
“The session runs too long” → the timer is stretched by starvation.
“Cut off with time on the clock” → the WebSocket heartbeat arrived late; the browser thought the connection was dead and disconnected; after a grace period the client was ejected, the clock frozen on its last tick.
“The notionals flicker” → broadcasts arrived late and out of order.
The storm of “token...