Webwright: A Terminal Is All You Need for Web Agents

pyinstallwoes1 pts0 comments

Webwright: A Terminal Is All You Need For Web Agents - Microsoft Research

Skip to main content

Research

Publications<br>Code & data<br>People<br>Microsoft Research blog

Artificial intelligence<br>Audio & acoustics<br>Computer vision<br>Graphics & multimedia<br>Human-computer interaction<br>Human language technologies<br>Search & information retrieval

Data platforms and analytics<br>Hardware & devices<br>Programming languages & software engineering<br>Quantum computing<br>Security, privacy & cryptography<br>Systems & networking

Algorithms<br>Mathematics

Ecology & environment<br>Economics<br>Medical, health & genomics<br>Social sciences<br>Technology for emerging markets

Academic programs<br>Events & academic conferences<br>Microsoft Research Forum

Behind the Tech podcast<br>Microsoft Research blog<br>Microsoft Research Forum<br>Microsoft Research podcast

About Microsoft Research<br>Careers & internships<br>People<br>Emeritus program<br>News & awards<br>Microsoft Research newsletter

Africa<br>AI for Science<br>AI Frontiers<br>Asia-Pacific<br>Cambridge<br>Health Futures<br>India<br>Montreal<br>New England<br>New York City<br>Redmond

Applied Sciences<br>Mixed Reality & AI - Cambridge<br>Mixed Reality & AI - Zurich

Register: Research Forum

Microsoft Security<br>Azure<br>Dynamics 365<br>Microsoft 365<br>Microsoft Teams<br>Windows 365

Microsoft AI<br>Azure Space<br>Mixed reality<br>Microsoft HoloLens<br>Microsoft Viva<br>Quantum computing<br>Sustainability

Education<br>Automotive<br>Financial services<br>Government<br>Healthcare<br>Manufacturing<br>Retail

Find a partner<br>Become a partner<br>Partner Network<br>Microsoft Marketplace<br>Software companies

Blog<br>Microsoft Advertising<br>Developer Center<br>Documentation<br>Events<br>Licensing<br>Microsoft Learn<br>Microsoft Research

View Sitemap

AI Frontiers

Webwright: A Terminal Is All You Need For Web Agents

Published<br>May 4, 2026

Share this page

Share on Facebook

Share on X

Share on LinkedIn

Share on Reddit

Subscribe to our RSS feed

Webwright GitHub repo

Webwright project page

By Yadong Lu1, Lingrui Xu2, Chao Huang2, Ahmed Awadallah1<br>1Microsoft Research, 2The University of Hong Kong

Instead of solving web tasks by predicting where to click one at a time, we only give the model a terminal where it has the full freedom to spawn browser sessions, and to explore websites through writing code. The final result was a reusable program to complete any web tasks. We found this minimal harness to be surprisingly effective in solving web tasks.

TL;DR

Existing web agents often drive a persistent browser session one action at a time. We instead reduce the web-agent harness to a deliberately minimal terminal-based setup: three modules, roughly 1K lines of code, one agent loop, and no multi-agent orchestration. The agent emits bash commands and controls the browser by writing Playwright code, reaching SOTA results on Odysseys and Online-Mind2Web with a 100-step budget.

Because actions are expressed as code, the agent can naturally chain many web interactions within a single step, and spawn multiple browser sessions, making execution far more efficient than predicting one primitive action at a time.

We show the resulting script can be packaged as a reusable CLI with arguments. In a cost analysis, GPT-5.4 averages $2.37 per task, yielding a reusable RPA-style script. With our crafted tools, even a smaller model (Qwen3.5-9B) achieves strong performance on the hard split of Online-Mind2Web.

Once a task script is crafted, it can be shared and reused across platforms—e.g., Codex, Claude Code, and OpenClaw.

Beyond step-by-step web interaction in a stateful browser

The dominant paradigm for web agents today treats the browser session itself as the agent’s workspace. At each step, the model receives the current page state—through a screenshot, or page state text—and predicts the next operation to apply to that same session. This operation may be a low-level action such as click, type, or scroll; a structured command such as selecting a DOM element; or, more recently, a short code snippet executed through a CLI tool call. In all cases, they share a common constraint: the agent is required to predict web actions one step at a time within a predefined interaction loop.

This design was useful when LLM agents had limited ability to reason, code, and recover from errors. A carefully engineered harness helped bridge the gap between what the model could reliably produce and what real web tasks required. But as models become stronger—especially at writing and debugging code—the same harness becomes a bottleneck, constraining the agent to a narrow interaction loop instead of letting it solve the task more flexibly.

Webwright builds upon this view. We separate the agent from the browser, and treat the browser as something the agent can launch, inspect, and discard while developing a program. The persistent artifact is not the browser session, but the code and logs in the local workspace. The agent can write exploratory scripts, spawn fresh browser sessions, and freely decide when to capture screenshots, inspect failures, and iteratively refine its code—much...

microsoft research code agent browser webwright

Related Articles