LLMs as 5x Faster Sandboxes

GitHub - experientiallabs/world-model-harness: World-model-as-a-harness for simulating AI agent environments · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Uh oh!

There was an error while loading. Please reload this page.

experientiallabs

world-model-harness

Public

Notifications You must be signed in to change notification settings

Fork

Star 28

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 30 Commits 30 Commits

examples

wmh

.env.example

.gitignore

AGENTS.md

CLAUDE.md

README.md

pyproject.toml

uv.lock

View all files

Repository files navigation

World Model Harness

Docker as an LLM. Simulate an agent environment from traces instead of standing up a sandbox.

A frontier LLM acts as the environment your agent steps against, reconstructed from OpenTelemetry traces. The harness ingests recorded (state, action) -> observation steps, builds a retrieval index, evolves the base environment prompt with GEPA, and serves the resulting world model locally.

How It Works

Build from OTel traces: ingest, normalize, split train/held-out, index the replay buffer, and optimize the environment prompt.

Serve or play the built model: agents call WorldModel.step(action) in-process or through the local HTTP backend.

Evaluate reconstruction fidelity with wmh eval against trace files.

Quickstart

uv sync wmh providers verify wmh build --name airline --file examples/tau-bench/traces.otel.jsonl wmh list wmh eval examples/tau-bench/traces.otel.jsonl wmh eval list wmh eval run tau-bench wmh eval results wmh examples list wmh examples run tau-bench -- --trace 0 wmh serve wmh demo --name airline wmh play --name airline

wmh build with no flags launches a guided creation wizard on an interactive terminal. Pass --file and related flags, or --no-interactive, for scriptable runs.

World models are named and stored under .wmh/models//. wmh list, wmh serve, wmh demo, and wmh play only use models built locally in that directory.

CLI Reference

Command What it does

wmh build Builds a named world model from OTel traces or a vendor trace pull. It ingests traces, normalizes them, splits train/held-out data, builds the retrieval index, runs GEPA prompt optimization, and writes the artifact to .wmh/models//. With no required inputs on a TTY, it opens the guided wizard.

wmh list Lists world models found under the selected root's models/ directory, including provider, held-out score, rollout count, and frontier size when those metrics exist. By default, the selected root is .wmh/, so plain wmh list does not read committed example artifacts.

wmh eval Scores reconstruction fidelity on one or more OTel trace files. It performs a deterministic train/held-out split, replays held-out steps through the base or supplied prompt, grades predicted observations against recorded observations, and prints per-file plus overall fidelity.

wmh eval list Lists named eval suites from examples//evals/*.toml. Suites are example-local definitions for repeatable reconstruction-fidelity runs.

wmh eval run Runs a named eval suite, using its configured trace files and split/scoring settings. Results are written as local JSON under .wmh/evals/// unless --out is supplied. The default suite for an example can be selected by task name, e.g. wmh eval run tau-bench.

wmh eval results [suite] Summarizes locally saved named eval results from .wmh/evals/. These are generated artifacts and should not be committed.

wmh serve Starts the local FastAPI backend on 127.0.0.1:8000 by default. It serves all locally built models, or only the repeated --name selections, through /world_models/... HTTP routes.

wmh demo Runs a short demo against a built model. A throwaway LLM agent proposes an action from sampled trace examples, the world model predicts the environment observation, and the CLI prints the action, environment prompt, and observation.

wmh play Opens an interactive REPL for a built model. You type tool calls or free-text actions, and the...

LLMs as 5x Faster Sandboxes

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level