Comparing Fable and 10 other LLMs on refactoring a LangGraph god node

Korridzy1 pts0 comments

Twilight of the Gods. Fable and 10 more LLMs on a Code Reorganization Task. Comparison. - wtf

Skip to content

Initializing search

Reviews

Best analyst

Theses

Run 2

Run 3

Reproduce the analysis

Archive

Categories

Lemonade from lemons

Stage one. The models generate proposals

Stage two. The models evaluate the proposals

Stage three. The wet swimsuit contest

Approach two. Comparing reviews by theses

Approach three. Center of opinion and medoid

Deus ex machina

Takeaways

Twilight of the Gods. Fable and 10 more LLMs on a Code Reorganization Task. Comparison.¶

Other languages

Эта статья также доступна на русском: Гибель богов.

Materials & raw data

All 11 model proposals, the cross-reviews, the theses runs, and the ranking script are published here: Materials & reproduce this experiment .

This is a detailed write-up of one experiment. I took a god node from a real LangGraph agent and asked 5 American and 6 Chinese models first to propose how to untangle it, then to evaluate each other's proposals. After that, I tried three different ways to figure out which of them to trust on the matter.

Contents

The original problem What a god node is and why it's dangerous

What the plan node actually did

Lemonade from lemons Why all this, and how the experiment is set up

Stage one. The models generate proposals

The proposals table

A bit more on each proposal

Stage two. The models evaluate the proposals

The reviews table

A bit more on each review

Stage three. The wet swimsuit contest Deciding who's good at what

Approach one. Do the scores agree? Picking the best proposal

Approach two. Comparing reviews by theses Picking the best analyst

Approach three. Center of opinion and medoid Picking the best analyst again

Deus ex machina. Picking the best analyst one more time

Takeaways Which model to use as a generator, which as an evaluator, and where your heart will find peace.

The original problem¶

You know how it goes: you're building a practice AI agent with the fellas on a course by Data Sanity, and amid the colorful whirl of rapidly accreting features you suddenly notice that one of the project's internal agents has a state graph (LangGraph) that looks like this:

flowchart TD<br>planner_start([START]) --> plan[plan]

plan -->|search| search[search]<br>plan -->|ask_user| ask_user[ask_user / interrupt]<br>plan -->|reflect| reflect[reflect]<br>plan -->|calculate| calculate[calculate]<br>plan -->|finish| finish[finish]

search -->|last_observation| observe[observe]<br>search -->|no hits / backend failure| plan<br>observe --> plan<br>calculate --> plan<br>ask_user --> observe_user[observe_user]<br>observe_user --> plan<br>reflect --> plan<br>finish --> planner_end([END])<br>At first glance this is just a cute little octopus — nothing to worry about. But once you know how much logic this octopus has to hold in its modest eight-legged head, it becomes clear right away that we're looking at an anti-pattern. In this case, let's call it a god node.

The plan node hides about 350 lines of logic, including iterative checks, bootstrap questions about region and currency, schema preparation, acquisition-task routing, the LLM call, the subsequent correction of the decision, and so on.

The problem isn't just the size of the function. When important orchestration is hidden inside a single node, the graph stops being a representation of the system. It's harder to explain, harder to debug, harder to test, and more dangerous to change. So the obvious task isn't merely to "chop a big function into pieces" but to lift the hidden control logic up to the graph level, so that the resulting architecture becomes clearer and more amenable to further development.

What the plan node actually did&para;

The agent this graph was meant to describe was, broadly, in the business of collecting various parameters for downstream calculations. Some of these parameters it cleverly searched for on the internet; some it asked the user about. And it did all this by a not-fully-deterministic algorithm, because depending on the context of a particular conversation, the right way to obtain the same parameter could vary considerably. Here is the set of real functions that had been packed into the plan node:

Responsibility<br>What logic was hidden inside plan

Iteration loop<br>Incrementing iterations, entering a new planning step, checking status == "aborted" and max_iters

Region bootstrap question<br>The _needs_region_question() check and a forced transition to ask_user for core.region

Currency bootstrap question<br>The _needs_currency_question() check and a forced transition to ask_user for core.currency

Proactive decomposition<br>Generating dynamic_decompositions for fields that need to be broken into components

Assembling acquisition recipes<br>Calling build_dynamic_recipes() and preparing the task structure for the subsequent field collection

Schema preparation<br>Calling compose_ready_fields(), merging ready component fields into aggregates, and updating...

plan node proposals search stage ask_user

Related Articles