GitHub - experientiallabs/world-model-harness: World-model-as-a-harness for simulating AI agent environments · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
Uh oh!
There was an error while loading. Please reload this page.
experientiallabs
world-model-harness
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star<br>28
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>30 Commits<br>30 Commits
examples
examples
wmh
wmh
.env.example
.env.example
.gitignore
.gitignore
AGENTS.md
AGENTS.md
CLAUDE.md
CLAUDE.md
README.md
README.md
pyproject.toml
pyproject.toml
uv.lock
uv.lock
View all files
Repository files navigation
World Model Harness
Docker as an LLM. Simulate an agent environment from traces instead of standing up a sandbox.
A frontier LLM acts as the environment your agent steps against, reconstructed from OpenTelemetry<br>traces. The harness ingests recorded (state, action) -> observation steps, builds a retrieval index,<br>evolves the base environment prompt with GEPA, and serves the resulting world model locally.
How It Works
Build from OTel traces: ingest, normalize, split train/held-out, index the replay buffer, and<br>optimize the environment prompt.
Serve or play the built model: agents call WorldModel.step(action) in-process or through the<br>local HTTP backend.
Evaluate reconstruction fidelity with wmh eval against trace files.
Quickstart
uv sync<br>wmh providers verify<br>wmh build --name airline --file examples/tau-bench/traces.otel.jsonl<br>wmh list<br>wmh eval examples/tau-bench/traces.otel.jsonl<br>wmh eval list<br>wmh eval run tau-bench<br>wmh eval results<br>wmh examples list<br>wmh examples run tau-bench -- --trace 0<br>wmh serve<br>wmh demo --name airline<br>wmh play --name airline
wmh build with no flags launches a guided creation wizard on an interactive terminal. Pass<br>--file and related flags, or --no-interactive, for scriptable runs.
World models are named and stored under .wmh/models//. wmh list, wmh serve, wmh demo,<br>and wmh play only use models built locally in that directory.
CLI Reference
Command<br>What it does
wmh build<br>Builds a named world model from OTel traces or a vendor trace pull. It ingests traces, normalizes them, splits train/held-out data, builds the retrieval index, runs GEPA prompt optimization, and writes the artifact to .wmh/models//. With no required inputs on a TTY, it opens the guided wizard.
wmh list<br>Lists world models found under the selected root's models/ directory, including provider, held-out score, rollout count, and frontier size when those metrics exist. By default, the selected root is .wmh/, so plain wmh list does not read committed example artifacts.
wmh eval<br>Scores reconstruction fidelity on one or more OTel trace files. It performs a deterministic train/held-out split, replays held-out steps through the base or supplied prompt, grades predicted observations against recorded observations, and prints per-file plus overall fidelity.
wmh eval list<br>Lists named eval suites from examples//evals/*.toml. Suites are example-local definitions for repeatable reconstruction-fidelity runs.
wmh eval run<br>Runs a named eval suite, using its configured trace files and split/scoring settings. Results are written as local JSON under .wmh/evals/// unless --out is supplied. The default suite for an example can be selected by task name, e.g. wmh eval run tau-bench.
wmh eval results [suite]<br>Summarizes locally saved named eval results from .wmh/evals/. These are generated artifacts and should not be committed.
wmh serve<br>Starts the local FastAPI backend on 127.0.0.1:8000 by default. It serves all locally built models, or only the repeated --name selections, through /world_models/... HTTP routes.
wmh demo<br>Runs a short demo against a built model. A throwaway LLM agent proposes an action from sampled trace examples, the world model predicts the environment observation, and the CLI prints the action, environment prompt, and observation.
wmh play<br>Opens an interactive REPL for a built model. You type tool calls or free-text actions, and the...