Webwright: A Terminal Is All You Need for Web Agents

Webwright: A Terminal Is All You Need For Web Agents - Microsoft Research

Research

Publications Code & data People Microsoft Research blog

Artificial intelligence Audio & acoustics Computer vision Graphics & multimedia Human-computer interaction Human language technologies Search & information retrieval

Data platforms and analytics Hardware & devices Programming languages & software engineering Quantum computing Security, privacy & cryptography Systems & networking

Algorithms Mathematics

Ecology & environment Economics Medical, health & genomics Social sciences Technology for emerging markets

Academic programs Events & academic conferences Microsoft Research Forum

Behind the Tech podcast Microsoft Research blog Microsoft Research Forum Microsoft Research podcast

About Microsoft Research Careers & internships People Emeritus program News & awards Microsoft Research newsletter

Africa AI for Science AI Frontiers Asia-Pacific Cambridge Health Futures India Montreal New England New York City Redmond

Applied Sciences Mixed Reality & AI - Cambridge Mixed Reality & AI - Zurich

Microsoft Security Azure Dynamics 365 Microsoft 365 Microsoft Teams Windows 365

Microsoft AI Azure Space Mixed reality Microsoft HoloLens Microsoft Viva Quantum computing Sustainability

Education Automotive Financial services Government Healthcare Manufacturing Retail

Find a partner Become a partner Partner Network Microsoft Marketplace Software companies

Blog Microsoft Advertising Developer Center Documentation Events Licensing Microsoft Learn Microsoft Research

View Sitemap

AI Frontiers

Webwright: A Terminal Is All You Need For Web Agents

Published May 4, 2026

Share this page

Share on Facebook

Share on X

Share on LinkedIn

Share on Reddit

Subscribe to our RSS feed

Webwright GitHub repo

Webwright project page

By Yadong Lu1, Lingrui Xu2, Chao Huang2, Ahmed Awadallah1 1Microsoft Research, 2The University of Hong Kong

Instead of solving web tasks by predicting where to click one at a time, we only give the model a terminal where it has the full freedom to spawn browser sessions, and to explore websites through writing code. The final result was a reusable program to complete any web tasks. We found this minimal harness to be surprisingly effective in solving web tasks.

TL;DR

Existing web agents often drive a persistent browser session one action at a time. We instead reduce the web-agent harness to a deliberately minimal terminal-based setup: three modules, roughly 1K lines of code, one agent loop, and no multi-agent orchestration. The agent emits bash commands and controls the browser by writing Playwright code, reaching SOTA results on Odysseys and Online-Mind2Web with a 100-step budget.

Because actions are expressed as code, the agent can naturally chain many web interactions within a single step, and spawn multiple browser sessions, making execution far more efficient than predicting one primitive action at a time.

We show the resulting script can be packaged as a reusable CLI with arguments. In a cost analysis, GPT-5.4 averages $2.37 per task, yielding a reusable RPA-style script. With our crafted tools, even a smaller model (Qwen3.5-9B) achieves strong performance on the hard split of Online-Mind2Web.

Once a task script is crafted, it can be shared and reused across platforms—e.g., Codex, Claude Code, and OpenClaw.

Beyond step-by-step web interaction in a stateful browser

The dominant paradigm for web agents today treats the browser session itself as the agent’s workspace. At each step, the model receives the current page state—through a screenshot, or page state text—and predicts the next operation to apply to that same session. This operation may be a low-level action such as click, type, or scroll; a structured command such as selecting a DOM element; or, more recently, a short code snippet executed through a CLI tool call. In all cases, they share a common constraint: the agent is required to predict web actions one step at a time within a predefined interaction loop.

This design was useful when LLM agents had limited ability to reason, code, and recover from errors. A carefully engineered harness helped bridge the gap between what the model could reliably produce and what real web tasks required. But as models become stronger—especially at writing and debugging code—the same harness becomes a bottleneck, constraining the agent to a narrow interaction loop instead of letting it solve the task more flexibly.

Webwright builds upon this view. We separate the agent from the browser, and treat the browser as something the agent can launch, inspect, and discard while developing a program. The persistent artifact is not the browser session, but the code and logs in the local workspace. The agent can write exploratory scripts, spawn fresh browser sessions, and freely decide when to capture screenshots, inspect failures, and iteratively refine its code—much...

Webwright: A Terminal Is All You Need for Web Agents

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits