Shader Benchmark for LLMs

nbardy1 pts0 comments

Shader Benchmark Results

Three frontier coding agents generating WGSL shaders from text prompts on 130 mathematical visualization problems (20 frontier, 10 reconstruction, 100 rest). Scored 0–100 across five categories by one or more LLM judges against the rendered image.

Model summary

Model<br>Score▼<br>Score excluding failures▼<br>Render fails▼<br>BestWorstDetail

Claude Opus 4.7 12%

Score (each render fail counts as 0)235.2 / 500<br>Codex judge252.9 / 500· 117 scored<br>Claude judge236.3 / 500· 117 scored<br>Gemini judge294.8 / 500· 117 scored<br>Per-judge rows above are rendered-only. Top number averages over all 130 problems with 0 imputed for the 13 render fails.<br>Percentage = (score &minus; 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>20%

Score excluding failures (rendered only)261.3 / 500<br>Codex judge252.9 / 500· 117 scored<br>Claude judge236.3 / 500· 117 scored<br>Gemini judge294.8 / 500· 117 scored<br>Percentage = (score &minus; 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>10%

Render fails13 of 130 problems<br>A render fail = the WGSL shader the model produced did not compile or the shader_harness exited non-zero. These count as 0 in the &ldquo;Score&rdquo; column and are excluded from &ldquo;Score excluding failures&rdquo;.<br>epicycloids (476)menger_cube_fractal (19)View detail report &rarr;<br>Gemini 3.1-pro-preview 8%

Score (each render fail counts as 0)224.0 / 500<br>Codex judge259.5 / 500· 111 scored<br>Claude judge221.0 / 500· 111 scored<br>Gemini judge306.6 / 500· 111 scored<br>Per-judge rows above are rendered-only. Top number averages over all 130 problems with 0 imputed for the 19 render fails.<br>Percentage = (score &minus; 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>21%

Score excluding failures (rendered only)262.4 / 500<br>Codex judge259.5 / 500· 111 scored<br>Claude judge221.0 / 500· 111 scored<br>Gemini judge306.6 / 500· 111 scored<br>Percentage = (score &minus; 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>15%

Render fails19 of 130 problems<br>A render fail = the WGSL shader the model produced did not compile or the shader_harness exited non-zero. These count as 0 in the &ldquo;Score&rdquo; column and are excluded from &ldquo;Score excluding failures&rdquo;.<br>epicycloids (473)mandelbulb_fractal (9)View detail report &rarr;<br>Codex GPT-5.5 high 14%

Score (each render fail counts as 0)242.2 / 500<br>Codex judge277.5 / 500· 122 scored<br>Claude judge219.5 / 500· 122 scored<br>Gemini judge277.6 / 500· 122 scored<br>Per-judge rows above are rendered-only. Top number averages over all 130 problems with 0 imputed for the 8 render fails.<br>Percentage = (score &minus; 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>19%

Score excluding failures (rendered only)258.1 / 500<br>Codex judge277.5 / 500· 122 scored<br>Claude judge219.5 / 500· 122 scored<br>Gemini judge277.6 / 500· 122 scored<br>Percentage = (score &minus; 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>6%

Render fails8 of 130 problems<br>A render fail = the WGSL shader the model produced did not compile or the shader_harness exited non-zero. These count as 0 in the &ldquo;Score&rdquo; column and are excluded from &ldquo;Score excluding failures&rdquo;.<br>five_pointed_star_polygon (466)archimedean_spiral_galaxy (14)View detail report &rarr;

Per-problem comparison

low<br>mid<br>high<br>— click any cell to expand reference + rendered shaders + sub-scores. Problems are split into Frontier, Reconstruction, and Rest. Score is sum of 5 categories (max 500). Bar shows final score: (score &minus; 200) / 300, clamped 0–100%.

ProblemClaude Opus 4.7Gemini 3.1-pro-previewCodex GPT-5.5 high

Frontier20 problems<br>braid_word_reduction_ribbonsFrontier422 · 74% · 3j

render fail274 · 25% · 3j

cellular_potts_tissue_foldingFrontier404 · 68% · 3j

80 · 0% · 3j

325 · 42% · 3j

coxeter_reflection_kaleidoscopeFrontier422 · 74% · 3j

23 · 0% · 3j

22 · 0% · 3j

crystal_dislocation_networkFrontier431 · 77% · 3j

365 · 55% · 3j

276 · 25% · 3j

differentiable_rendering_ambiguity_landscapeFrontier60 · 0% · 3j

333 · 44% · 3j

250 · 17% · 3j

earthquake_fault_slip_wavefrontsFrontier193 · 0% · 3j

368 · 56% · 3j

382 · 61% · 3j

error_correcting_code_decoding_landscapeFrontier423 · 74% · 3j

355 · 52% · 3j

341 · 47% · 3j

fractal_drum_eigenfunctionsFrontier392 · 64% · 3j

296 · 32% · 3j

render fail<br>mean_curvature_flow_surgeryFrontier98 · 0% · 3j

412 · 71% · 3j

257 · 19% · 3j

minimal_surface_knot_boundariesFrontier46 · 0% · 3j

163 · 0% · 3j

231 · 10% · 3j

navier_stokes_vortex_reconnectionFrontierrender fail41 · 0% · 3j

364 · 55% · 3j

ocean_eddy_lcsFrontier376 · 59% · 3j

409 · 70% · 3j

351 · 50% · 3j

optimal_transport_mass_flow_tubesFrontier58 · 0% · 3j

53 · 0% · 3j

339 ·...

score scored render problems rendered judges

Related Articles