AI juries
AI juries
This is part of a series I’m writing on<br>generative AI.
State: Withdrawn. This text is predicated on an<br>unjustified assumption: that multiple evaluations of the same prompt by<br>the same generative AI are equivalent to independent jurors. That may<br>not be true, in which case the conclusion doesn’t follow.
Condorcet’s<br>Jury Theorem
Imagine a jury of N experts<br>trying to decide if a binary (true/false) fact is true. Each expert has<br>an independent probability p<br>of being right. Since they are experts, assume p > 0.5 –they’re at least better<br>than a coin flip.
We can ask for a vote: have each expert state their binary answer and<br>take the most popular answer. Let’s call the probability that the<br>majority of the jury is right j. This “jury accuracy” probability<br>increases dramatically as N<br>grows.
For example, with a jury of N = 25 experts, each likely to be<br>right with a probability p = 0.75, the probability j that the majority is<br>right is already 99.663%!
The table below shows how powerful this effect is. It shows the jury<br>accuracy j for different jury<br>sizes N (rows) and individual<br>expert accuracies p<br>(columns):
N (Experts)<br>p = 0.51<br>p = 0.55<br>p = 0.60<br>p = 0.70<br>p = 0.75<br>p = 0.80<br>p = 0.90<br>p = 0.95
0.510<br>0.550<br>0.600<br>0.700<br>0.750<br>0.800<br>0.900<br>0.950
0.515<br>0.575<br>0.648<br>0.784<br>0.844<br>0.896<br>0.972<br>0.993
0.519<br>0.593<br>0.683<br>0.837<br>0.896<br>0.942<br>0.991<br>0.999
11<br>0.527<br>0.633<br>0.753<br>0.922<br>0.966<br>0.988<br>1.000<br>1.000
15<br>0.531<br>0.654<br>0.787<br>0.950<br>0.983<br>0.996<br>1.000<br>1.000
25<br>0.540<br>0.694<br>0.846<br>0.983<br>0.997<br>1.000<br>1.000<br>1.000
35<br>0.547<br>0.725<br>0.886<br>0.994<br>0.999<br>1.000<br>1.000<br>1.000
45<br>0.554<br>0.751<br>0.914<br>0.998<br>1.000<br>1.000<br>1.000<br>1.000
55<br>0.559<br>0.772<br>0.934<br>0.999<br>1.000<br>1.000<br>1.000<br>1.000
75<br>0.569<br>0.808<br>0.960<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000
101<br>0.580<br>0.844<br>0.979<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000
201<br>0.612<br>0.923<br>0.998<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000
501<br>0.673<br>0.988<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000
1001<br>0.737<br>0.999<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000
5001<br>0.921<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000<br>1.000
The following is roughly the same data in a plot (x axis in log<br>scale):
Condorcet’s Jury Theorem
This “wisdom of the crowds” effect works, at least in this idealized<br>scenario where all experts are independent.
There’s an interesting flip side: if the “experts” are more likely to<br>be wrong (p ), the jury’s performance<br>plummets. The majority vote just amplifies the error, converging towards<br>0% accuracy (i.e., being reliably wrong). You can see this by swapping<br>the definitions of “right” and “wrong” in the description above.
This model isn’t new; it goes back to the dawn of the French<br>revolution. It was first proposed in 1785 by the Marquis of Condorcet<br>and is known as his Jury Theorem.
Scaling AI juries
In Dumb AI and the software revolution I mentioned<br>a key advantage of generative AI: in addition to working at light speed<br>(answering questions in seconds, not days), it can be scaled<br>instantly :
The infinite<br>monkey is no longer hitting keys at random –just somewhat stupidly.<br>But at this speed and with instant scaling, the difference is<br>monumental.
If you can get an AI juror to answer a binary question with an<br>accuracy p > 0.5, you can<br>achieve arbitrarily high performance simply by scaling resources:<br>running bigger juries.
Even if your AI is barely better than a coin flip (say, p = 0.51), you can still<br>get a jury with j > 0.99<br>accuracy, though it can be quite expensive (with $p = 0.51%, you’d need<br>a jury of N = 5001 just to hit<br>j = 0.92).
But, as the table above shows, if you improve your agent’s accuracy<br>p to, say, 0.70, with N = 25 you’re already above 98%<br>accuracy.
And, interestingly, scaling the jury doesn’t impact latency. The<br>questions can be executed in parallel, so the total time is determined<br>by the slowest agent. I suspect strategies like hedging (e.g., run N + M jurors, take the<br>answer from the N first<br>responders) may be applicable.
What types of questions?
So what kinds of questions can this be useful for? I have a few<br>practical, everday examples in mind:
This unit test fails. Is the implementation correct (regarding<br>the tested property)? I intend to use this to determine the next step:<br>ask an agent to fix the implementation, or ask it to fix the<br>test.
A variation of the above: Does this code implement a specific<br>property (described in natural language)?
I’ve received a code change (possibly from AI, possibly from a<br>human; it doesn’t matter). Does it adhere to a specific coding style<br>guideline? The jury’s answer determines if I run another agent to fix<br>the style issue.
Is this function’s documentation (docstring) still accurate for<br>the code it describes?
I have written a new recipe. Does it uphold one<br>of my principles for my recipes?
The practical trade-off
I’m excited to have a tool where I can create my own AI juries. Based<br>on their observed performance (false positives/negatives ratios), I can<br>tweak the contexts (increase p) or adjust the jury size (increase<br>N), until I find the right<br>balance between final...