LLM System Design Benchmark | LLM System Design Benchmark Skip to content LLM System Design Benchmark
What This Is<br>Section titled “What This Is”
This benchmark evaluates how well different LLMs perform on system design tasks. Each model receives the same cold system design prompt — no examples, no hints — and produces a complete design with architecture, capacity estimation, tradeoffs, and failure analysis. Independent LLM judges then score every transcript on 5 dimensions.
I evaluated 9 models on 9 problems with 3 judges — 81 transcripts scored in total. See the methodology.
Any feedback or request? Please submit an issue.
Leaderboard<br>Section titled “Leaderboard”
RankModelMean Score±CIRuns1kimi-k2.64.39±0.1392gpt-5.44.34±0.1693claude-sonnet-4.64.26±0.0994gpt-oss-120b4.02±0.195deepseek-v4-pro4.00±0.1196gemini-3.1-pro3.87±0.1497gemma-4-31b-it3.44±0.1798gpt-oss-20b3.39±0.1499minimax-m2.73.28±0.329
Buy me a coffee — or 10M tokens worth ☕