RL Speedrun

t552 pts0 comments

GitHub - JeanKaddour/sokoban_speedrun: Teach Qwen3 Sokoban. The fastest recipe wins. · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

JeanKaddour

sokoban_speedrun

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>102 Commits<br>102 Commits

datasets

datasets

records

records

.gitignore

.gitignore

README.md

README.md

assemble_record.sh

assemble_record.sh

eval_speedrun.py

eval_speedrun.py

make_record_report.py

make_record_report.py

modal_app.py

modal_app.py

pyproject.toml

pyproject.toml

speedrun.py

speedrun.py

uv.lock

uv.lock

verify_record.py

verify_record.py

View all files

Repository files navigation

Sokoban Speedrun

Fastest recipe for RL fine-tuning Qwen3-4B-Instruct-2507 from 57% to >80% held-out pass@1 solve-rate on Sokoban puzzles, using a single 8xH100 node.

Play Sokoban if the task is unfamiliar.

World Record History

Record time<br>FLOPs<br>Description<br>Date<br>Log<br>held-out pass@1<br>Contributors

1:27:31<br>1.251 EFLOP<br>GRPO, LR 1.6e-6 annealed, 75 steps<br>2026-06-17<br>records/2026-06-17_01<br>0.891 (CI [0.86, 0.92])<br>@JeanKaddour

Rules

Fastest wall-clock run wins: one training run on one 8xH100 node, measured from training step 1 through final checkpoint write, whose final checkpoint clears the target.

Target: lower 95% bootstrap CI > 0.80 on datasets/sokoban_eval.jsonl.

Eval: 8 completions/puzzle, 12,288 tokens, temperature 0.8, top-p 0.95, seed 12345.

Fixed: model, train set, eval set, reward function, hardware.

Open: RL algorithm, loss, schedules, engine, parallelism, domain-agnostic rewards, prompt.

Not allowed: Sokoban-specific hints, heuristics, or few-shot examples.

Verification: maintainers rerun at a second seed; both runs must clear the target.

Submit

Train, then eval the final checkpoint. Logs, rollouts, source snapshots, and eval JSON are written automatically.

Run python make_record_report.py records/ and fill in the Idea section.

Open a PR adding the record directory plus a leaderboard row. CI runs python verify_record.py records/.

Running the current record

On a local 8xH100 node:

/step_000075">NODE_GPUS=8 torchrun --standalone --nproc_per_node=3 -m speedrun<br>python -m eval_speedrun --eval-checkpoint outputs/run>/step_000075

Modal

modal_app.py rents an 8xH100 box on Modal. Upload the datasets once, then start a run and eval its checkpoint after it finishes:

modal volume put nanochat-rl-hf datasets/sokoban_train.jsonl /datasets/sokoban_train.jsonl<br>modal volume put nanochat-rl-hf datasets/sokoban_eval.jsonl /datasets/sokoban_eval.jsonl<br>modal run --detach modal_app.py<br>EVAL_CHECKPOINT=latest modal run modal_app.py

Credits

Thanks to @joshua-a-harris and his nanoRL speedrun, nanochat, modded-nanoGPT, nanoRL, ScaleRL, and ReasoningGym.

About

Teach Qwen3 Sokoban. The fastest recipe wins.

Resources

Readme

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

stars

Watchers

watching

Forks

forks

Report repository

Releases

No releases published

Packages

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python<br>99.3%

Shell<br>0.7%

You can’t perform that action at this time.

datasets speedrun sokoban reload eval modal

Related Articles