Shader Benchmark Results
Three frontier coding agents generating WGSL shaders from text prompts on 130 mathematical visualization problems (20 frontier, 10 reconstruction, 100 rest). Scored 0–100 across five categories by one or more LLM judges against the rendered image.
Model summary
Model<br>Score▼<br>Score excluding failures▼<br>Render fails▼<br>BestWorstDetail
Claude Opus 4.7 12%
Score (each render fail counts as 0)235.2 / 500<br>Codex judge252.9 / 500· 117 scored<br>Claude judge236.3 / 500· 117 scored<br>Gemini judge294.8 / 500· 117 scored<br>Per-judge rows above are rendered-only. Top number averages over all 130 problems with 0 imputed for the 13 render fails.<br>Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>20%
Score excluding failures (rendered only)261.3 / 500<br>Codex judge252.9 / 500· 117 scored<br>Claude judge236.3 / 500· 117 scored<br>Gemini judge294.8 / 500· 117 scored<br>Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>10%
Render fails13 of 130 problems<br>A render fail = the WGSL shader the model produced did not compile or the shader_harness exited non-zero. These count as 0 in the “Score” column and are excluded from “Score excluding failures”.<br>epicycloids (476)menger_cube_fractal (19)View detail report →<br>Gemini 3.1-pro-preview 8%
Score (each render fail counts as 0)224.0 / 500<br>Codex judge259.5 / 500· 111 scored<br>Claude judge221.0 / 500· 111 scored<br>Gemini judge306.6 / 500· 111 scored<br>Per-judge rows above are rendered-only. Top number averages over all 130 problems with 0 imputed for the 19 render fails.<br>Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>21%
Score excluding failures (rendered only)262.4 / 500<br>Codex judge259.5 / 500· 111 scored<br>Claude judge221.0 / 500· 111 scored<br>Gemini judge306.6 / 500· 111 scored<br>Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>15%
Render fails19 of 130 problems<br>A render fail = the WGSL shader the model produced did not compile or the shader_harness exited non-zero. These count as 0 in the “Score” column and are excluded from “Score excluding failures”.<br>epicycloids (473)mandelbulb_fractal (9)View detail report →<br>Codex GPT-5.5 high 14%
Score (each render fail counts as 0)242.2 / 500<br>Codex judge277.5 / 500· 122 scored<br>Claude judge219.5 / 500· 122 scored<br>Gemini judge277.6 / 500· 122 scored<br>Per-judge rows above are rendered-only. Top number averages over all 130 problems with 0 imputed for the 8 render fails.<br>Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>19%
Score excluding failures (rendered only)258.1 / 500<br>Codex judge277.5 / 500· 122 scored<br>Claude judge219.5 / 500· 122 scored<br>Gemini judge277.6 / 500· 122 scored<br>Percentage = (score − 200) / 300. The judges floor around 200 even for unrecognizable images, so the interesting range is 200–500.<br>6%
Render fails8 of 130 problems<br>A render fail = the WGSL shader the model produced did not compile or the shader_harness exited non-zero. These count as 0 in the “Score” column and are excluded from “Score excluding failures”.<br>five_pointed_star_polygon (466)archimedean_spiral_galaxy (14)View detail report →
Per-problem comparison
low<br>mid<br>high<br>— click any cell to expand reference + rendered shaders + sub-scores. Problems are split into Frontier, Reconstruction, and Rest. Score is sum of 5 categories (max 500). Bar shows final score: (score − 200) / 300, clamped 0–100%.
ProblemClaude Opus 4.7Gemini 3.1-pro-previewCodex GPT-5.5 high
Frontier20 problems<br>braid_word_reduction_ribbonsFrontier422 · 74% · 3j
render fail274 · 25% · 3j
cellular_potts_tissue_foldingFrontier404 · 68% · 3j
80 · 0% · 3j
325 · 42% · 3j
coxeter_reflection_kaleidoscopeFrontier422 · 74% · 3j
23 · 0% · 3j
22 · 0% · 3j
crystal_dislocation_networkFrontier431 · 77% · 3j
365 · 55% · 3j
276 · 25% · 3j
differentiable_rendering_ambiguity_landscapeFrontier60 · 0% · 3j
333 · 44% · 3j
250 · 17% · 3j
earthquake_fault_slip_wavefrontsFrontier193 · 0% · 3j
368 · 56% · 3j
382 · 61% · 3j
error_correcting_code_decoding_landscapeFrontier423 · 74% · 3j
355 · 52% · 3j
341 · 47% · 3j
fractal_drum_eigenfunctionsFrontier392 · 64% · 3j
296 · 32% · 3j
render fail<br>mean_curvature_flow_surgeryFrontier98 · 0% · 3j
412 · 71% · 3j
257 · 19% · 3j
minimal_surface_knot_boundariesFrontier46 · 0% · 3j
163 · 0% · 3j
231 · 10% · 3j
navier_stokes_vortex_reconnectionFrontierrender fail41 · 0% · 3j
364 · 55% · 3j
ocean_eddy_lcsFrontier376 · 59% · 3j
409 · 70% · 3j
351 · 50% · 3j
optimal_transport_mass_flow_tubesFrontier58 · 0% · 3j
53 · 0% · 3j
339 ·...