Webwright: A terminal is all you need for web agents

sorenbs2 pts0 comments

Webwright | Terminal-Native Web Agents

Terminal-native web agents

A terminal is all you need for web agents.

Webwright gives the model a terminal, a local workspace, and the freedom to write code that launches, inspects, and discards browser sessions. The output is not just a completed task, but a reusable program.

How it works<br>Watch the trace<br>GitHub<br>Microsoft Research Blog

core modules

~1K<br>lines of harness code

86.7%<br>Online-Mind2Web accuracy

60.8%<br>Odysseys score

Paradigm shift

In Webwright, agent can launch multiple browser sessions in terminal.

Traditional web agents keep one browser session alive and predict the next click, type, or scroll. Webwright separates the agent from that session: the browser can be launched, inspected, and discarded, while code, logs, screenshots, and outputs persist in the local workspace.

Disposable browsers

The agent can spawn fresh browser sessions, capture screenshots only when useful, inspect failures, and rerun scripts without being trapped in a single stateful page.

Code composes actions

Date selection, form filling, filtering, comparison, and extraction can become loops and functions instead of long chains of primitive browser actions.

Artifacts survive

The durable output is a workspace: exploratory scripts, action logs, screenshots, final outputs, and eventually a reusable task program.

Reported results

A small harness, competitive long-horizon performance.

The report evaluates Webwright on live, long-horizon web benchmarks while preserving the simple terminal interface. The same pipeline also records critical-point screenshots, action logs, and reusable command-line tools.

Odysseys — long-horizon browsing

Online-Mind2Web — accuracy on 300 live tasks

60.8%<br>Odysseys

Long-horizon browsing score, a 35.1% relative improvement over the previous reported SOTA.

86.7%<br>Online-Mind2Web

GPT-5.4 accuracy on 300 live tasks across 136 sites with a 100-step budget.

66.2%<br>Small model tools

Qwen3.5-9B on the hard split of Online-Mind2Web when augmented with crafted reusable tools.

Minimal harness

One loop, three modules, no orchestration tower.

The implementation is deliberately small: a Runner, a Model Endpoint, and a terminal Environment. Each is a single module, totaling roughly 1K lines of harness code, with no multi-agent orchestration or complex planning hierarchy.

01<br>Send context The runner sends the task, workspace state, and recent observations to the model.

02<br>Emit bash The model returns a thinking block and a shell command, often writing Playwright-backed scripts to explore pages and collect data.

03<br>Return observations The environment runs the command and returns terminal output, logs, screenshots, files, or error tracebacks.

04<br>Refine and finish The loop continues until the agent produces a final script, reruns it in a fresh folder, and passes self-reflection.

workspace/run

$ python final_script.py<br>open browser<br>search live web pages<br>capture screenshots<br>write action log

$ python -m webwright.tools.self_reflection<br>evaluate critical points<br>Status: success

$ ls final_runs/run_1<br>final_script.py<br>final_script_log.txt<br>screenshots/<br>self_reflect_result.json

Workspace trace

Watch a long web task turn into files, commands, and a verified final run.

The trace below makes the terminal-native loop visible. The left panel shows the workspace growing as the agent creates plans, scripts, logs, screenshots, and final-run artifacts; the terminal transcript shows the generated command and command_output that produced each observation.

Webwright v.s. Vision based (both GPT-5.4)

Your browser does not support the video tag.

Claude Code with v.s without Webwright Skills

Your browser does not support the video tag.

Browsing history becomes reusable code

Your browser does not support the video tag.

Capability gallery

We show webwright can craft tools for user tasks, and converted to codex skills for repeated usage, which leads to token and time saving.

Generated CLI tools / skills<br>Four domain workflows demonstrate reusable task programs that can be packaged, selected, and compared against baseline runs.

Flights with skill

Generated CLI Tool

Skill-guided Google Flights comparison for a Hong Kong to Jeju trip.

Shows a generated flight skill being selected and reused as a task-specific tool.

Challenges handled

Open-ended terminal actions need verification, memory control, and reusable outputs.

Giving an agent a terminal is powerful, but it creates new failure modes. Webwright keeps the harness small while adding just enough structure around completion, context, and reuse.

Premature done gate

The agent must generate a final script, rerun it in a fresh folder, save logs and screenshots, and pass a self-reflection judgement before done is accepted.

Context compaction

Long coding trajectories can exceed context limits, so history is periodically compacted into summaries while the workspace keeps the concrete artifacts.

Reusable tools

Once...

terminal webwright browser screenshots workspace reusable

Related Articles