Why did one day of AI cost more than a month of servers?

dxs2 pts0 comments

Why did one day of AI cost more than a month of servers? - junueno.dev Why did one day of AI cost more than a month of servers?<br>2026-06-29 · llm, claudecode, idempotency, ai<br>Same old story: I’m running the SaaS our CFO shipped to production in two days. A non-engineer exec builds something fast with Claude Code, and the engineer (me) goes through the back end one piece at a time. Every time I look, something crawls out.

This time it wasn’t “where the secrets live,” and it wasn’t “there isn’t a single test.” This time, money burned.

One day I was staring at the LLM API cost graph, and there was a single day sticking up like Mount Fuji. Every other day hugs the floor; that one day pokes the sky. Roughly half of the whole month’s bill landed on that one day.

I’ll be honest, my stomach dropped when I saw the number. Because that single day of AI usage alone cost more than a full month of servers. Running the entire server fleet for a month is cheaper than letting the AI talk for one day. How is that a thing?

So I go ask the person who built it (the CFO): “What did you do that day?”

The answer:

“Honestly, I don’t remember what I did.”

Come on.

But this isn’t a story about blame (well, half of it isn’t). The deeper I dug, the more I landed on: of course they don’t remember. It wasn’t a human that burned the money. It was the retry machinery.

The hunt. At first I assumed they’d just hammered it all day

My first read was, roughly: “You built a bunch of features that day, tested them in prod over and over, and hit the expensive LLM every time. Death by a thousand cuts.”

And it looked plausible. The commit history for that day was packed from morning to evening, with twenty-plus changes around the AI generation flow. So “slow burn from human repetition” had a face.

Then they actually dug into the app-side logs (task queue, DB, requests), and the picture was completely different. It wasn’t a slow burn. The same heavy batch was being re-run, in full, by a machine, over and over. For a single tenant, a job that normally runs once had run 21 times.

A human doesn’t press the same button 21 times in a day. The thing pressing the button wasn’t human.

The scariest part was “it succeeds, then it falls over”

This is the core of the whole incident, so let me go slow.

The batch called several LLMs in sequence and saved the results to the DB. The flow, roughly:

Fire a pile of queries at several LLMs (this is where the money goes )

Write the returned results to the DB

The problem was in step 2: the write referenced a column that was supposed to have been added but wasn’t there yet. The DB didn’t have the column, so it threw column does not exist and the job returned a 500.

When you hear “it failed,” you naturally picture “the call bombed and wasted a shot.” Nope. Every LLM call succeeded. All 200s. Which means every one of them was billed, properly. You paid, you got the result back, and then it tripped on the very last step — the save.

If I put it in restaurant terms: you finish the full course, you pay the check, and right as you go to say “thanks for the meal,” you trip, fall, and lose your memory. You come to, back at your seat, and start eating the same full course again. Twenty-one times. What you ate (= what you were billed for) doesn’t un-happen, but every round starts from zero.

There’s a term, “retry storm.” Usually you picture it as “the call fails, fails again, fails again” — a flurry of misses. But this wasn’t misses. It was a storm of throwing away the hits (the successes) and drawing a fresh hit each time. That’s the counterintuitive part, and the scariest.

How did this happen? There were two culprits

The machine repeated it 21 times because of two pitfalls working together.

Pitfall 1: the deploy order was backwards.<br>The code shipped to production assuming a new column existed, but the migration that adds that column hadn’t been applied to prod yet. Code first, schema second. In that order, the code reaches for a column that isn’t there and fails deterministically. And “deterministically” is the kicker — it’s the kind of failure that never fixes itself no matter how many times you retry.

Pitfall 2: when it fails, the task queue kindly re-runs it.<br>A managed task queue sees a job die with a 500 and goes “oh, that failed, let me run it again for you,” automatically. For a transient network blip, that’s the correct kindness. But this failure was “the column doesn’t exist.” No amount of re-running grows the column. It kept repeating an unfixable failure, infinitely, out of kindness.

And because the batch wasn’t idempotent (it didn’t skip already-processed work), every re-run starts over from the top. So every round carries the full LLM bill.

Deterministic failure × automatic retry × non-idempotent. When those three mesh, money burns quietly. No wonder the person doesn’t remember — they didn’t do anything. The thing holding down the button was the queue.

When I laid it out, the CFO scrunched...

wasn column month time cost full

Related Articles