Pybench: Pytest but to check non regression of noisy benchmarks

ururu010101 pts0 comments

GitHub - AnthonyBeeblebrox/pybench: Discover benchmark functions, run them across many seeds, and statistically detect regressions against a saved baseline. · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

AnthonyBeeblebrox

pybench

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

master

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>1 Commit<br>1 Commit

.pybench

.pybench

docs

docs

examples

examples

src/pybench

src/pybench

tests

tests

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

.python-version

.python-version

.readthedocs.yaml

.readthedocs.yaml

README.md

README.md

dev.py

dev.py

pyproject.toml

pyproject.toml

uv.lock

uv.lock

View all files

Repository files navigation

pybench

Discover benchmark functions, run them across many seeds, and statistically<br>detect regressions against a saved baseline.

pybench reruns each benchmark on the same stored seeds as its baseline, so<br>the comparison is paired (far more sensitive than a two-sample test), and<br>judges the whole benchmark with a within-seed sign-flip permutation test that<br>respects correlation across metrics and steps.

Docs: pybench.readthedocs.io

Install

uv add git+https://github.com/AnthonyBeeblebrox/pybench # or: pip install git+https://github.com/AnthonyBeeblebrox/pybench

Quickstart

Write a bench_* function that takes a seed and returns a score (higher is<br>better; prefix lower-is-better metrics with min:):

float:<br>return train_and_score(seed) # a float, or a dict, or a list[dict] of steps"># benchmarks/bench_model.py<br>def bench_accuracy(seed: int) -> float:<br>return train_and_score(seed) # a float, or a dict, or a list[dict] of steps

pybench # 1st time: samples seeds, saves a baseline, marks NEW<br>pybench # later: reruns on the same seeds, marks PASS / FAIL (exit 1 on fail)<br>pybench update --yes # re-baseline after an intended change<br>pybench show # print current baseline stats (--history for per-commit history)

pybench exits non-zero when any benchmark regresses, so it drops straight<br>into CI like pytest.

Return formats

def bench_a(seed): return 0.91 # scalar<br>def bench_b(seed): return {"accuracy": 0.91, "min:loss": 0.42} # multiple metrics<br>def bench_c(seed): # multi-step curve<br>return [{"step": 1, "min:loss": 0.9}, {"step": 10, "min:loss": 0.3}]

Configuration

Per-benchmark settings are keyword-only defaults — no config file:

list[dict]:<br>...">def bench_training(seed: int, *, n_seeds: int = 50, alpha: float = 0.01,<br>min_effect: float = 0.02, workers: int = 4) -> list[dict]:<br>...

Parameter<br>Default<br>Meaning

n_seeds<br>30<br>Seeds sampled for the baseline

alpha<br>0.05<br>Significance threshold

min_effect<br>None<br>Minimum relative drop to flag (suppress trivia)

workers<br>Parallel seed processes (keep 1 for GPU/serial)

Commit your baseline

The baseline lives at .pybench/baselines.jsonl (one line per benchmark).<br>Commit it to git — do not gitignore it. History is delegated to git: commit<br>the file after each pybench update, and pybench show --history reconstructs<br>the baseline at every commit that touched it.

About

Discover benchmark functions, run them across many seeds, and statistically detect regressions against a saved baseline.

pybench.readthedocs.io/en/latest/

Resources

Readme

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

stars

Watchers

watching

Forks

forks

Report repository

Releases

No releases published

Packages

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python<br>100.0%

You can’t perform that action at this time.

pybench commit baseline seed benchmark seeds

Related Articles