Replay-based regression testing for ROS 2

mbradber1 pts1 comments

GitHub - Kaedim/perception-replay-ci: Replay-based regression testing for ROS 2 perception stacks. Same recorded /scan, two perception versions, automatic pass/fail. · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

Kaedim

perception-replay-ci

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>2 Commits<br>2 Commits

bags/golden_obstacle

bags/golden_obstacle

scripts

scripts

.gitignore

.gitignore

CLAUDE.md

CLAUDE.md

README.md

README.md

demo_spec.md

demo_spec.md

pixi.lock

pixi.lock

pixi.toml

pixi.toml

View all files

Repository files navigation

perception-replay-ci

A proof-of-concept Robot CI regression test. Replay a recorded /scan log through two versions of a perception stack (baseline vs. candidate) and automatically detect whether the candidate produces incorrect output.

The thesis: given the same recorded robot sensor log, can we replay it through two versions of a perception stack and automatically catch a regression in the candidate?

This is open-loop replay testing — the robot does not move during the test. The bag is the fixture, the perception nodes are the unit under test, and the comparator decides pass/fail. Full spec in demo_spec.md.

Stack

ROS 2 Humble via RoboStack (robostack-humble conda channel)

TurtleBot3 in Gazebo for capturing the test fixture

MCAP-format rosbags

Python perception nodes (rclpy)

pixi for environment + tasks

Currently osx-arm64 only (see [workspace] platforms in pixi.toml).

Install

pixi install

This pulls the full ROS Humble desktop, TurtleBot3 packages, and the MCAP storage plugin.

End-to-end demo

1. Capture the golden bag (one time)

Three shells:

pixi run sim # Gazebo + TurtleBot3 in turtlebot3_world<br>pixi run record-golden # records /scan /tf /tf_static /odom to bags/golden_obstacle/<br>pixi run control # teleop keyboard — drive toward obstacles

Drive forward, approach an obstacle until the front of /scan reads ~0.4–0.5 m, hold, back off, repeat for a second obstacle if you like. Ctrl-C the recorder first when done so the MCAP finalizes cleanly.

2. Replay the bag through each perception version

For each version, three shells:

pixi run baseline # or `candidate`<br>pixi run record-run runs/baseline.jsonl # or runs/candidate.jsonl<br>pixi run -- ros2 bag play bags/golden_obstacle

When the bag finishes, Ctrl-C the recorder (it'll log the line count).

3. Compare

pixi run compare

Exits 0 on PASS, 1 on FAIL — CI-friendly.

Example FAIL output:

Test: turtlebot3_laserscan_obstacle_regression<br>Result: FAIL

Reason:<br>- Candidate failed to detect obstacle for 16.2s across 4 window(s)<br>- Minimum observed distance during misses: 0.25m

Disagreement windows:<br>99.75s → 108.55s ( 8.80s, n= 45, min_range=0.31m): miss<br>123.15s → 127.55s ( 4.40s, n= 23, min_range=0.25m): miss<br>127.95s → 130.35s ( 2.40s, n= 13, min_range=0.25m): miss<br>131.15s → 131.75s ( 0.60s, n= 4, min_range=0.27m): miss

Recommendation:<br>Do not deploy candidate perception config.

How it works

Both perception nodes subscribe to /scan, find the minimum range in a forward ±30° wedge, and publish:

/obstacle/detected (std_msgs/Bool) — fired when min range /obstacle/range (sensor_msgs/Range, stamped with the scan's original timestamp)

The baseline uses a 0.50 m threshold; the candidate uses 0.25 m (the intentional defect). Both publish to the same topics — the workflow runs the bag twice, once per node, never simultaneously.

record_run.py taps both topics during a replay and writes one JSONL row per scan: {"t": , "detected": bool, "min_range": float}.

compare_runs.py reads two JSONL files, pairs rows by index (timestamps are identical across runs because both come from the same bag), groups disagreements into contiguous windows, classifies them as miss (baseline=true, candidate=false) or false_alarm (the inverse), and emits the pass/fail report.

Layout

scripts/<br>perception_baseline.py reference detector (threshold 0.50m)<br>perception_candidate.py intentionally broken (threshold...

pixi perception replay candidate scan fail

Related Articles