Webwright | Terminal-Native Web Agents
Terminal-native web agents
A terminal is all you need for web agents.
Webwright gives the model a terminal, a local workspace, and the freedom to write code that launches, inspects, and discards browser sessions. The output is not just a completed task, but a reusable program.
How it works<br>Watch the trace<br>GitHub<br>Microsoft Research Blog
core modules
~1K<br>lines of harness code
86.7%<br>Online-Mind2Web accuracy
60.8%<br>Odysseys score
Paradigm shift
In Webwright, agent can launch multiple browser sessions in terminal.
Traditional web agents keep one browser session alive and predict the next click, type, or scroll. Webwright separates the agent from that session: the browser can be launched, inspected, and discarded, while code, logs, screenshots, and outputs persist in the local workspace.
Disposable browsers
The agent can spawn fresh browser sessions, capture screenshots only when useful, inspect failures, and rerun scripts without being trapped in a single stateful page.
Code composes actions
Date selection, form filling, filtering, comparison, and extraction can become loops and functions instead of long chains of primitive browser actions.
Artifacts survive
The durable output is a workspace: exploratory scripts, action logs, screenshots, final outputs, and eventually a reusable task program.
Reported results
A small harness, competitive long-horizon performance.
The report evaluates Webwright on live, long-horizon web benchmarks while preserving the simple terminal interface. The same pipeline also records critical-point screenshots, action logs, and reusable command-line tools.
Odysseys — long-horizon browsing
Online-Mind2Web — accuracy on 300 live tasks
60.8%<br>Odysseys
Long-horizon browsing score, a 35.1% relative improvement over the previous reported SOTA.
86.7%<br>Online-Mind2Web
GPT-5.4 accuracy on 300 live tasks across 136 sites with a 100-step budget.
66.2%<br>Small model tools
Qwen3.5-9B on the hard split of Online-Mind2Web when augmented with crafted reusable tools.
Minimal harness
One loop, three modules, no orchestration tower.
The implementation is deliberately small: a Runner, a Model Endpoint, and a terminal Environment. Each is a single module, totaling roughly 1K lines of harness code, with no multi-agent orchestration or complex planning hierarchy.
01<br>Send context The runner sends the task, workspace state, and recent observations to the model.
02<br>Emit bash The model returns a thinking block and a shell command, often writing Playwright-backed scripts to explore pages and collect data.
03<br>Return observations The environment runs the command and returns terminal output, logs, screenshots, files, or error tracebacks.
04<br>Refine and finish The loop continues until the agent produces a final script, reruns it in a fresh folder, and passes self-reflection.
workspace/run
$ python final_script.py<br>open browser<br>search live web pages<br>capture screenshots<br>write action log
$ python -m webwright.tools.self_reflection<br>evaluate critical points<br>Status: success
$ ls final_runs/run_1<br>final_script.py<br>final_script_log.txt<br>screenshots/<br>self_reflect_result.json
Workspace trace
Watch a long web task turn into files, commands, and a verified final run.
The trace below makes the terminal-native loop visible. The left panel shows the workspace growing as the agent creates plans, scripts, logs, screenshots, and final-run artifacts; the terminal transcript shows the generated command and command_output that produced each observation.
Webwright v.s. Vision based (both GPT-5.4)
Your browser does not support the video tag.
Claude Code with v.s without Webwright Skills
Your browser does not support the video tag.
Browsing history becomes reusable code
Your browser does not support the video tag.
Capability gallery
We show webwright can craft tools for user tasks, and converted to codex skills for repeated usage, which leads to token and time saving.
Generated CLI tools / skills<br>Four domain workflows demonstrate reusable task programs that can be packaged, selected, and compared against baseline runs.
Flights with skill
Generated CLI Tool
Skill-guided Google Flights comparison for a Hong Kong to Jeju trip.
Shows a generated flight skill being selected and reused as a task-specific tool.
Challenges handled
Open-ended terminal actions need verification, memory control, and reusable outputs.
Giving an agent a terminal is powerful, but it creates new failure modes. Webwright keeps the harness small while adding just enough structure around completion, context, and reuse.
Premature done gate
The agent must generate a final script, rerun it in a fresh folder, save logs and screenshots, and pass a self-reflection judgement before done is accepted.
Context compaction
Long coding trajectories can exceed context limits, so history is periodically compacted into summaries while the workspace keeps the concrete artifacts.
Reusable tools
Once...