Twilight of the Gods. Fable and 10 more LLMs on a Code Reorganization Task. Comparison. - wtf
Skip to content
Initializing search
Reviews
Best analyst
Theses
Run 2
Run 3
Reproduce the analysis
Archive
Categories
Lemonade from lemons
Stage one. The models generate proposals
Stage two. The models evaluate the proposals
Stage three. The wet swimsuit contest
Approach two. Comparing reviews by theses
Approach three. Center of opinion and medoid
Deus ex machina
Takeaways
Twilight of the Gods. Fable and 10 more LLMs on a Code Reorganization Task. Comparison.¶
Other languages
Эта статья также доступна на русском: Гибель богов.
Materials & raw data
All 11 model proposals, the cross-reviews, the theses runs, and the ranking script are published here: Materials & reproduce this experiment .
This is a detailed write-up of one experiment. I took a god node from a real LangGraph agent and asked 5 American and 6 Chinese models first to propose how to untangle it, then to evaluate each other's proposals. After that, I tried three different ways to figure out which of them to trust on the matter.
Contents
The original problem What a god node is and why it's dangerous
What the plan node actually did
Lemonade from lemons Why all this, and how the experiment is set up
Stage one. The models generate proposals
The proposals table
A bit more on each proposal
Stage two. The models evaluate the proposals
The reviews table
A bit more on each review
Stage three. The wet swimsuit contest Deciding who's good at what
Approach one. Do the scores agree? Picking the best proposal
Approach two. Comparing reviews by theses Picking the best analyst
Approach three. Center of opinion and medoid Picking the best analyst again
Deus ex machina. Picking the best analyst one more time
Takeaways Which model to use as a generator, which as an evaluator, and where your heart will find peace.
The original problem¶
You know how it goes: you're building a practice AI agent with the fellas on a course by Data Sanity, and amid the colorful whirl of rapidly accreting features you suddenly notice that one of the project's internal agents has a state graph (LangGraph) that looks like this:
flowchart TD<br>planner_start([START]) --> plan[plan]
plan -->|search| search[search]<br>plan -->|ask_user| ask_user[ask_user / interrupt]<br>plan -->|reflect| reflect[reflect]<br>plan -->|calculate| calculate[calculate]<br>plan -->|finish| finish[finish]
search -->|last_observation| observe[observe]<br>search -->|no hits / backend failure| plan<br>observe --> plan<br>calculate --> plan<br>ask_user --> observe_user[observe_user]<br>observe_user --> plan<br>reflect --> plan<br>finish --> planner_end([END])<br>At first glance this is just a cute little octopus — nothing to worry about. But once you know how much logic this octopus has to hold in its modest eight-legged head, it becomes clear right away that we're looking at an anti-pattern. In this case, let's call it a god node.
The plan node hides about 350 lines of logic, including iterative checks, bootstrap questions about region and currency, schema preparation, acquisition-task routing, the LLM call, the subsequent correction of the decision, and so on.
The problem isn't just the size of the function. When important orchestration is hidden inside a single node, the graph stops being a representation of the system. It's harder to explain, harder to debug, harder to test, and more dangerous to change. So the obvious task isn't merely to "chop a big function into pieces" but to lift the hidden control logic up to the graph level, so that the resulting architecture becomes clearer and more amenable to further development.
What the plan node actually did¶
The agent this graph was meant to describe was, broadly, in the business of collecting various parameters for downstream calculations. Some of these parameters it cleverly searched for on the internet; some it asked the user about. And it did all this by a not-fully-deterministic algorithm, because depending on the context of a particular conversation, the right way to obtain the same parameter could vary considerably. Here is the set of real functions that had been packed into the plan node:
Responsibility<br>What logic was hidden inside plan
Iteration loop<br>Incrementing iterations, entering a new planning step, checking status == "aborted" and max_iters
Region bootstrap question<br>The _needs_region_question() check and a forced transition to ask_user for core.region
Currency bootstrap question<br>The _needs_currency_question() check and a forced transition to ask_user for core.currency
Proactive decomposition<br>Generating dynamic_decompositions for fields that need to be broken into components
Assembling acquisition recipes<br>Calling build_dynamic_recipes() and preparing the task structure for the subsequent field collection
Schema preparation<br>Calling compose_ready_fields(), merging ready component fields into aggregates, and updating...