No One Can Compare LLMs

home

2026-06-12

Recently at work I received access to a Claude account (that I stopped using because found it *meh*). It works well, but since I prefer ChatGPT I asked for a swap. When I did so, my manager seemed surprised - in his opinion, Claude is simply better.

That got me thinking about what “better” actually means.

I generally give people the benefit of the doubt, so let’s assume he’s right. Claude is better for him. So why is ChatGPT better for me?

And I figured: comparing LLMs objectively is almost impossible because the interaction is deeply personalized .

When you interact with a LLM, you’re interacting with it much like you would with another human (magic of chat interface). And we all communicate differently - we have our habits, shortcuts, assumptions, biases. Two people rarely write the same prompt for the same task.

The growing use of persistent memory makes this even more visible. Same prompt can produce different results depending on what the model has learned about you over months of interaction.

Thus: there is no objectively better model . There is only a model that aligns better with the way you work.

My Prompting Style

My prompting style is sloppy.

For example, while preparing the rik 0.4.0 release:

Check JJ repository from 0.3.0. I would like to prepare for the new release so check what was changed. Propose modifications to README. Modify changelog. Bump version. Etc.

Or when troubleshooting Kakoune:

I don’t see cursor when in insert mode, fix it.

I don’t care about grammar, capitalization, typos. I frequently steer the task while working. I’d rather fix the output afterward than spend time on the perfect prompt.

Surprisingly though, ChatGPT often gets what I mean in a single shot. Claude is like an assistant I’d fire after a week.

But: De gustibus non est disputandum1.

Work Style

Another example: When I’m working on rik, I often create a temporary file in the repository, usually something like test.txt, to check how markers behave in a file. (by the way, check out rik if you haven’t - I’m quite proud with how it progresses).

Claude and ChatGPT react to the file in completely different ways.

Claude finds the new file and immediately decides it’s an important part of the project. It wants to lint it, add tests for it, make sure it’s tracked if it isn’t already, format it, in short - make it consistent with the rest of the directory contents. Nothing I wanted or asked for: frustrating.

GPT, on the other hand, reacts in a way that I find quite amusing. It notices that the file is called test.txt and the recent timestamp and infers that it’s probably something I created ad-hoc and the file shouldn’t be touched at all.

Again and again, this is my approach. Others might keep temporary files in a separate directory, or solve the problem completely differently.

This issue is totally personal one, and it only becomes visible because of how I work.

That doesn’t mean Claude is worse. Many say it’s better, and I believe them - with the one caveat: it’s not objectively better - it’s better for them.

Code

Working on rik shown me another interesting pattern; people constantly argue online about token usage.

Some claim Claude burns through context windows incredibly quickly. Others say exactly the opposite and complain about GPT. And everybody are right ;-)

The missing “variable” is the code they’re feeding into the model.

Different repositories have different structures, naming conventions, and architectural styles. That changes how good an agent is when navigating it.

Take file size.

I’ve seen agent-generated projects containing files with tens of thousands of lines of code.

I’m a software engineer who is allergic to that. Once a file grows beyond roughly a thousand lines, I start looking for ways to split it.

Maybe some developers working in agent-first workflows don’t care whether a file contains one thousand lines or hundred thousands as long as it work.

Then consistency.

Humans aren’t perfectly consistent. I can call something r, res, result, or tmp_result - all within the same Rust code.

LLMs are much more consistent than I am - they even often rename variables to keep naming uniform.

Imagine an agent searching for result. In my file agent might find three occurrences. In a highly uniform agent-generated repository it may find 1,500.

If those matches span a 100,000-line file, the model suddenly has to ingest a truckload of context. Multiply that by dozens of files and welcome warmly Mr. Session Limit joining in.

Me - I have many small files. Is it why Claude seems to consume context quickly in my code? Maybe, or maybe not. 418, I don’t know.

Same conclusion.

The problem isn’t at all about whether Claude or ChatGPT is better....

No One Can Compare LLMs

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs