Shader Benchmark for LLMs

Shader Benchmark Results

Three frontier coding agents generating WGSL shaders from text prompts on 130 mathematical visualization problems (20 frontier, 10 reconstruction, 100 rest). Scored 0–100 across five categories by one or more LLM judges against the rendered image.

Model summary

Model Score▼ Score excluding failures▼ Render fails▼ BestWorstDetail

Claude Opus 4.7 12%

Score (each render fail counts as 0)235.2 / 500 Codex judge252.9 / 500· 117 scored Claude judge236.3 / 500· 117 scored Gemini judge294.8 / 500· 117 scored Per-judge rows above are rendered-only. Top number averages over all 130 problems with 0 imputed for the 13 render fails. Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500. 20%

Score excluding failures (rendered only)261.3 / 500 Codex judge252.9 / 500· 117 scored Claude judge236.3 / 500· 117 scored Gemini judge294.8 / 500· 117 scored Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500. 10%

Render fails13 of 130 problems A render fail = the WGSL shader the model produced did not compile or the shader_harness exited non-zero. These count as 0 in the “Score” column and are excluded from “Score excluding failures”. epicycloids (476)menger_cube_fractal (19)View detail report → Gemini 3.1-pro-preview 8%

Score (each render fail counts as 0)224.0 / 500 Codex judge259.5 / 500· 111 scored Claude judge221.0 / 500· 111 scored Gemini judge306.6 / 500· 111 scored Per-judge rows above are rendered-only. Top number averages over all 130 problems with 0 imputed for the 19 render fails. Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500. 21%

Score excluding failures (rendered only)262.4 / 500 Codex judge259.5 / 500· 111 scored Claude judge221.0 / 500· 111 scored Gemini judge306.6 / 500· 111 scored Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500. 15%

Render fails19 of 130 problems A render fail = the WGSL shader the model produced did not compile or the shader_harness exited non-zero. These count as 0 in the “Score” column and are excluded from “Score excluding failures”. epicycloids (473)mandelbulb_fractal (9)View detail report → Codex GPT-5.5 high 14%

Score (each render fail counts as 0)242.2 / 500 Codex judge277.5 / 500· 122 scored Claude judge219.5 / 500· 122 scored Gemini judge277.6 / 500· 122 scored Per-judge rows above are rendered-only. Top number averages over all 130 problems with 0 imputed for the 8 render fails. Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500. 19%

Score excluding failures (rendered only)258.1 / 500 Codex judge277.5 / 500· 122 scored Claude judge219.5 / 500· 122 scored Gemini judge277.6 / 500· 122 scored Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500. 6%

Render fails8 of 130 problems A render fail = the WGSL shader the model produced did not compile or the shader_harness exited non-zero. These count as 0 in the “Score” column and are excluded from “Score excluding failures”. five_pointed_star_polygon (466)archimedean_spiral_galaxy (14)View detail report →

Per-problem comparison

low mid high — click any cell to expand reference + rendered shaders + sub-scores. Problems are split into Frontier, Reconstruction, and Rest. Score is sum of 5 categories (max 500). Bar shows final score: (score − 200) / 300, clamped 0–100%.

ProblemClaude Opus 4.7Gemini 3.1-pro-previewCodex GPT-5.5 high

Frontier20 problems braid_word_reduction_ribbonsFrontier422 · 74% · 3j

render fail274 · 25% · 3j

cellular_potts_tissue_foldingFrontier404 · 68% · 3j

80 · 0% · 3j

325 · 42% · 3j

coxeter_reflection_kaleidoscopeFrontier422 · 74% · 3j

23 · 0% · 3j

22 · 0% · 3j

crystal_dislocation_networkFrontier431 · 77% · 3j

365 · 55% · 3j

276 · 25% · 3j

differentiable_rendering_ambiguity_landscapeFrontier60 · 0% · 3j

333 · 44% · 3j

250 · 17% · 3j

earthquake_fault_slip_wavefrontsFrontier193 · 0% · 3j

368 · 56% · 3j

382 · 61% · 3j

error_correcting_code_decoding_landscapeFrontier423 · 74% · 3j

355 · 52% · 3j

341 · 47% · 3j

fractal_drum_eigenfunctionsFrontier392 · 64% · 3j

296 · 32% · 3j

render fail mean_curvature_flow_surgeryFrontier98 · 0% · 3j

412 · 71% · 3j

257 · 19% · 3j

minimal_surface_knot_boundariesFrontier46 · 0% · 3j

163 · 0% · 3j

231 · 10% · 3j

navier_stokes_vortex_reconnectionFrontierrender fail41 · 0% · 3j

364 · 55% · 3j

ocean_eddy_lcsFrontier376 · 59% · 3j

409 · 70% · 3j

351 · 50% · 3j

optimal_transport_mass_flow_tubesFrontier58 · 0% · 3j

53 · 0% · 3j

339 ·...

Shader Benchmark for LLMs

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI