AI’s favorite coding language is also the most expensive
Michael Carroll
SubscribeSign in
AI’s favorite coding language is also the most expensive<br>I ran the same agentic coding task 19 times across three languages. Ruby won on efficiency & speed. Python flunked big time.
Michael Carroll<br>May 19, 2026
Share
Trying my own rage bait
About 3 months ago I posted a study on LinkedIn that suggested Ruby is a strong language choice for agentic coding, which got more than a few accusations of being “rage bait”.
In my last post, I doubled down on the claim that AI agentic coding performs better — faster, fewer tokens — with Ruby than TypeScript and Python, the two other languages Coolhand Labs maintains packages for.<br>Recently, I found an opportunity to quantify and test that observation. I needed to add the same, relatively straightforward, feature across all three client packages. Seemed like a great opportunity to give my own rage bait a try.<br>For each package I used Claude Code + Sonnet 4.6 to:<br>Plan a solution to the GitHub issue
Approve the plan (no revisions)
Have a separate agent perform code review (Opus 4.7 for this step)
Create a PR
Keep fixing test and lint failures until the PR CI was green
Here are the results1:<br>Thanks for reading The Everything Engineer! Subscribe for free to receive new posts and support my work.
Subscribe
Note: The slowest Ruby run is hidden by the middle TS run. This is how Claude (using JS) chose to render it ;)<br>Maybe not just rage bait after all? As you can see:<br>Ruby is generally faster & more token efficient
TypeScript is a close second, and its results may be a victim of one really bad run
Python totally flunked the test… until I made some changes
But what surprised me is just how much slower Python is. I ended up running a lot more Python sessions to understand why. The answer says something interesting about both Python, your AI, and the state of agentic coding in general.<br>Finally, Ruby is fast at something!
My first run of the same feature across all three repos had an asymmetry I didn’t notice at first: my Python repo was already set up with a custom instruction to always run lint and tests before finishing — Ruby and TypeScript weren’t:
Python swallows tokens whole.<br>This instruction is useful, so I added it to all three repos and ran again. With a level playing field, the ordering flipped:
Telling Ruby and TypeScript to run the linting & test commands actually reduced their tool calls.<br>One thing I noticed in those Python runs was that it was generating far more linting and testing tool calls than the other two. I had Python running flake8, black, and pytest as three separate verification steps. Reading around pointed me to ruff as a faster, consolidated alternative. That should close the gap, right?
Still looking pretty ruff for Python…<br>Ruff did what it was supposed to — lint/test rounds per session dropped from 3–5 down to 2. But Python’s median token cost barely moved, and actually crept slightly higher.<br>What was actually going on?<br>LLMs are just as bad at setting up Python as everybody else
There’s a famous xkcd comic about Python environments:
Your LLM knows this pain well.<br>A typical Python session looked like this:<br>Claude: make verify
hit a pytest collection error, Claude: python
failed: python3.12
failed: /usr/local/bin/python3.12
failed: .venv/bin/python
… and on and on, with lots of cat pyproject.toml, cat .venv/pyvenv.cfg, ls ~/.pyenv/versions/ thrown in.<br>Honestly, it would be easy to make fun of Claude for flailing around like this, but untangling the complexities of the Python env is something a typical human engineer does. Even experienced Python devs — especially experienced Python devs — all have burned endless time on Python configuration hell.<br>What’s really interesting is that with LLMs we can now quantify, in token counts, how much Python configuration hell costs. Python sessions before I standardized the toolchain averaged about 25 bash invocations per session; Ruby averaged 17, TypeScript 15. Most of the delta wasn’t extra pytest runs or ruff checks — it was the interpreter resolution loop, repeated uv sync calls re-resolving the environment, and explicit env-debug commands that showed up in nearly every session.
Bashing it to pieces: Bash invocations per session.<br>Standardizing your environment yields big results
After four Python ruff sessions running a median of 14 minutes 18 seconds and 4.9M tokens, I shipped a standardization PR: a canonical make verify target that runs ruff and pytest in one call, a pinned .python-version file (should have had that in the first place), and a committed uv.lock. Then ran again.<br>The first post-standardization session came in at 9 minutes 31 seconds and 3.78M tokens. A 33% drop in active time and 23% fewer tokens versus the prior median!<br>Python session token cost across all configurations
But the session showed the AI was still flailing. Claude used make verify three times, hit a pytest...