AI's favorite coding language is also the most expensive

AI’s favorite coding language is also the most expensive

Michael Carroll

SubscribeSign in

AI’s favorite coding language is also the most expensive I ran the same agentic coding task 19 times across three languages. Ruby won on efficiency & speed. Python flunked big time.

Michael Carroll May 19, 2026

Trying my own rage bait

About 3 months ago I posted a study on LinkedIn that suggested Ruby is a strong language choice for agentic coding, which got more than a few accusations of being “rage bait”.

In my last post, I doubled down on the claim that AI agentic coding performs better — faster, fewer tokens — with Ruby than TypeScript and Python, the two other languages Coolhand Labs maintains packages for. Recently, I found an opportunity to quantify and test that observation. I needed to add the same, relatively straightforward, feature across all three client packages. Seemed like a great opportunity to give my own rage bait a try. For each package I used Claude Code + Sonnet 4.6 to: Plan a solution to the GitHub issue

Approve the plan (no revisions)

Have a separate agent perform code review (Opus 4.7 for this step)

Create a PR

Keep fixing test and lint failures until the PR CI was green

Here are the results1: Thanks for reading The Everything Engineer! Subscribe for free to receive new posts and support my work.

Note: The slowest Ruby run is hidden by the middle TS run. This is how Claude (using JS) chose to render it ;) Maybe not just rage bait after all? As you can see: Ruby is generally faster & more token efficient

TypeScript is a close second, and its results may be a victim of one really bad run

Python totally flunked the test… until I made some changes

But what surprised me is just how much slower Python is. I ended up running a lot more Python sessions to understand why. The answer says something interesting about both Python, your AI, and the state of agentic coding in general. Finally, Ruby is fast at something!

My first run of the same feature across all three repos had an asymmetry I didn’t notice at first: my Python repo was already set up with a custom instruction to always run lint and tests before finishing — Ruby and TypeScript weren’t:

Python swallows tokens whole. This instruction is useful, so I added it to all three repos and ran again. With a level playing field, the ordering flipped:

Telling Ruby and TypeScript to run the linting & test commands actually reduced their tool calls. One thing I noticed in those Python runs was that it was generating far more linting and testing tool calls than the other two. I had Python running flake8, black, and pytest as three separate verification steps. Reading around pointed me to ruff as a faster, consolidated alternative. That should close the gap, right?

Still looking pretty ruff for Python… Ruff did what it was supposed to — lint/test rounds per session dropped from 3–5 down to 2. But Python’s median token cost barely moved, and actually crept slightly higher. What was actually going on? LLMs are just as bad at setting up Python as everybody else

There’s a famous xkcd comic about Python environments:

Your LLM knows this pain well. A typical Python session looked like this: Claude: make verify

hit a pytest collection error, Claude: python

failed: python3.12

failed: /usr/local/bin/python3.12

failed: .venv/bin/python

… and on and on, with lots of cat pyproject.toml, cat .venv/pyvenv.cfg, ls ~/.pyenv/versions/ thrown in. Honestly, it would be easy to make fun of Claude for flailing around like this, but untangling the complexities of the Python env is something a typical human engineer does. Even experienced Python devs — especially experienced Python devs — all have burned endless time on Python configuration hell. What’s really interesting is that with LLMs we can now quantify, in token counts, how much Python configuration hell costs. Python sessions before I standardized the toolchain averaged about 25 bash invocations per session; Ruby averaged 17, TypeScript 15. Most of the delta wasn’t extra pytest runs or ruff checks — it was the interpreter resolution loop, repeated uv sync calls re-resolving the environment, and explicit env-debug commands that showed up in nearly every session.

Bashing it to pieces: Bash invocations per session. Standardizing your environment yields big results

After four Python ruff sessions running a median of 14 minutes 18 seconds and 4.9M tokens, I shipped a standardization PR: a canonical make verify target that runs ruff and pytest in one call, a pinned .python-version file (should have had that in the first place), and a committed uv.lock. Then ran again. The first post-standardization session came in at 9 minutes 31 seconds and 3.78M tokens. A 33% drop in active time and 23% fewer tokens versus the prior median! Python session token cost across all configurations

But the session showed the AI was still flailing. Claude used make verify three times, hit a pytest...

AI's favorite coding language is also the most expensive

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast