LLM Councils Show Groupthink

surprisetalk1 pts0 comments

LLM councils show groupthink - by Rohit Krishnan

Strange Loop Canon

SubscribeSign in

LLM councils show groupthink<br>perils and problems of LLM peer review

Rohit Krishnan<br>Jun 15, 2026

35

Share

One way to get the best out of LLMs is to use model diversity. The models are not all the same so if you use their unique natures, you can get better responses. We saw it with the work on MarketBench. And we also saw this when Karpathy came up with LLM Council as a way to get multiple models to work with each other and get us a better answer.<br>But I started wondering, with people, when you put a bunch of them together in a committee, some things get better but some things do get worse! And relying on an LLM to audit is also error-prone. “Design by committee” is a four letter word for a reason. LLMs are better than us probably, but surely this process is also somewhat lossy. So what do we lose?<br>To test it, I set up an experiment, where I set up a few committees of models:<br>First, I took each answer, then gave those to a fourth model and asked it to write the final version.

Then, the llm-council – essentially peer review and then a chairperson summarises

And a “best answer” picker – just a direct pick.

With people, the problem with committees is that they “smooth out” all idiosyncrasies. They take out any “spiky” points of view, and make things much more normie. Same thing here. So to test how we do I had to find some way to grade how the various final responses were. So I broke each answer into small “cards” using Sonnet. A card could be a mechanism, observation, metric, failure mode, image, or some other important detail.<br>Then I clustered cards that appeared to mean the same thing. If a cluster appeared in one solo answer, we called it a single-model idea. If it appeared in more than one, its shared. And two judges scored the solo-derived clusters without knowing which model produced them or whether a council kept them.<br>Now it’s not perfect, but it’s the cleanest way to test the problem of “how to rate which answer is better” that I could find without doing human rating.<br>First, the result: the council does not simply keep the best bits from everyone. It keeps a minority of the good ideas, while peer review seems to give consensus ideas an extra push.<br>Now, obviously the final summarized versions usually read better. It is calmer, more complete, less jagged, all things you’d expect. But we had misses. Examples.<br>A field report noticing that salvaged retail scent cartridges had become status symbols in a squatted mall, used to mask the smell of communal living.

An incident report arguing that logged-but-deprioritized risks are more dangerous than unknown ones, because they manufacture a false sense of control.

A data-recovery plan that asks users to re-confirm suspect fields at their next login (”please re-confirm your shipping address”), quietly crowdsourcing recovery from the one authoritative source.

In the final runs, the blended council kept only about a quarter of the good ideas that appeared in just one model’s answer. Remember, these were ideas that two blind judges rated as useful, non-obvious, and worth keeping, and still roughly three quarters did not make it into the final answer.<br>The peer-review version did not solve this either. The rare ideas survived at about the same rate as in plain blending: 24% versus 22%. But if several models had raised the same idea, the peer-review council kept it about a third of the time, but if only one model raised it, a quarter.<br>To test this, I ran sixteen open-ended prompts: eight strategy problems and eight writing tasks.

Figure 1. The experiment path from solo answers to idea coverage.<br>I plotted what happened with the ideas. The red dot below is good idea that only one model came up with. Blue is good ideas that multiple models came up with. And the X-axis shows how many of each actually showed up in the final answer. So the selector for instance showed about 37% of all good single-model ideas, and 24% of the multiple-models ideas, which makes sense because it picks one full answer and discards the others.

Figure 2. Coverage of blind-rated high-value ideas.<br>The consensus tilt is smaller here, but interesting. In the peer-review council, shared high-value ideas survived had a 11% uplift over single-model high-value ideas. Or put another way, a 50% relative lift!<br>The denominator for shared ideas is small though. What’s interesting is that this shows us how the specific topology of the “council” changes what you’re likely to get, like a peer-review round ends up becoming a consensus detector even above a single model blending the answers from all other models.<br>This is a problem with all cognitive beings. In group decision-making research, back in the 1980s, Stasser and Titus called it biased sampling of shared information - groups are more likely to discuss information that several members already know than information only one has. That line of work led to the...

ideas model answer council peer review

Related Articles