Fail loudly: a plea to stop hiding bugs

AI juries

This is part of a series I’m writing on generative AI.

State: Withdrawn. This text is predicated on an unjustified assumption: that multiple evaluations of the same prompt by the same generative AI are equivalent to independent jurors. That may not be true, in which case the conclusion doesn’t follow.

Condorcet’s Jury Theorem

Imagine a jury of N experts trying to decide if a binary (true/false) fact is true. Each expert has an independent probability p of being right. Since they are experts, assume p > 0.5 –they’re at least better than a coin flip.

We can ask for a vote: have each expert state their binary answer and take the most popular answer. Let’s call the probability that the majority of the jury is right j. This “jury accuracy” probability increases dramatically as N grows.

For example, with a jury of N = 25 experts, each likely to be right with a probability p = 0.75, the probability j that the majority is right is already 99.663%!

The table below shows how powerful this effect is. It shows the jury accuracy j for different jury sizes N (rows) and individual expert accuracies p (columns):

N (Experts) p = 0.51 p = 0.55 p = 0.60 p = 0.70 p = 0.75 p = 0.80 p = 0.90 p = 0.95

0.510 0.550 0.600 0.700 0.750 0.800 0.900 0.950

0.515 0.575 0.648 0.784 0.844 0.896 0.972 0.993

0.519 0.593 0.683 0.837 0.896 0.942 0.991 0.999

11 0.527 0.633 0.753 0.922 0.966 0.988 1.000 1.000

15 0.531 0.654 0.787 0.950 0.983 0.996 1.000 1.000

25 0.540 0.694 0.846 0.983 0.997 1.000 1.000 1.000

35 0.547 0.725 0.886 0.994 0.999 1.000 1.000 1.000

45 0.554 0.751 0.914 0.998 1.000 1.000 1.000 1.000

55 0.559 0.772 0.934 0.999 1.000 1.000 1.000 1.000

75 0.569 0.808 0.960 1.000 1.000 1.000 1.000 1.000

101 0.580 0.844 0.979 1.000 1.000 1.000 1.000 1.000

201 0.612 0.923 0.998 1.000 1.000 1.000 1.000 1.000

501 0.673 0.988 1.000 1.000 1.000 1.000 1.000 1.000

1001 0.737 0.999 1.000 1.000 1.000 1.000 1.000 1.000

5001 0.921 1.000 1.000 1.000 1.000 1.000 1.000 1.000

The following is roughly the same data in a plot (x axis in log scale):

Condorcet’s Jury Theorem

This “wisdom of the crowds” effect works, at least in this idealized scenario where all experts are independent.

There’s an interesting flip side: if the “experts” are more likely to be wrong (p ), the jury’s performance plummets. The majority vote just amplifies the error, converging towards 0% accuracy (i.e., being reliably wrong). You can see this by swapping the definitions of “right” and “wrong” in the description above.

This model isn’t new; it goes back to the dawn of the French revolution. It was first proposed in 1785 by the Marquis of Condorcet and is known as his Jury Theorem.

Scaling AI juries

In Dumb AI and the software revolution I mentioned a key advantage of generative AI: in addition to working at light speed (answering questions in seconds, not days), it can be scaled instantly :

The infinite monkey is no longer hitting keys at random –just somewhat stupidly. But at this speed and with instant scaling, the difference is monumental.

If you can get an AI juror to answer a binary question with an accuracy p > 0.5, you can achieve arbitrarily high performance simply by scaling resources: running bigger juries.

Even if your AI is barely better than a coin flip (say, p = 0.51), you can still get a jury with j > 0.99 accuracy, though it can be quite expensive (with $p = 0.51%, you’d need a jury of N = 5001 just to hit j = 0.92).

But, as the table above shows, if you improve your agent’s accuracy p to, say, 0.70, with N = 25 you’re already above 98% accuracy.

And, interestingly, scaling the jury doesn’t impact latency. The questions can be executed in parallel, so the total time is determined by the slowest agent. I suspect strategies like hedging (e.g., run N + M jurors, take the answer from the N first responders) may be applicable.

What types of questions?

So what kinds of questions can this be useful for? I have a few practical, everday examples in mind:

This unit test fails. Is the implementation correct (regarding the tested property)? I intend to use this to determine the next step: ask an agent to fix the implementation, or ask it to fix the test.

A variation of the above: Does this code implement a specific property (described in natural language)?

I’ve received a code change (possibly from AI, possibly from a human; it doesn’t matter). Does it adhere to a specific coding style guideline? The jury’s answer determines if I run another agent to fix the style issue.

Is this function’s documentation (docstring) still accurate for the code it describes?

I have written a new recipe. Does it uphold one of my principles for my recipes?

The practical trade-off

I’m excited to have a tool where I can create my own AI juries. Based on their observed performance (false positives/negatives ratios), I can tweak the contexts (increase p) or adjust the jury size (increase N), until I find the right balance between final...

Fail loudly: a plea to stop hiding bugs

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs