Fixing Failures in Browser-Use Models: Why More Data Isn't Enough

Yangyue Wang1, 2 , Harshvardhan Sikka1, 2 , Yash Mathur*2 , Tony Zhou*2 , Jinu Nyachhyon*2 , Pranav Guruprasad1, 2

* Equal contributions. 1Fig; 2Manifold Research Group.

0:00

/0:08

1×

Relevant links Models Dataset Paper Demo Code Cite this

TL;DR

We ran three LoRA fine-tuning experiments: varying perturbation type, data scale, and real vs. synthetic sources.

Counter-intuitively, augmentation degrades performance rather than improving it. We find this points to issues in model representations and standard fine-tuning methodologies instead of the data itself

We introduce the fine-tuned 7B GUI model trained on GUI-DR generated data to study the effects of synthetic data on the model's GUI grounding capability in supervised post-training.

Key Sections GUI model skill gaps Experimental setup Three experiments, three surprises Discussion What's next

GUI Perturbation — Research Series

Part 1 · Previous report

Dataset Release & Data Augmentation Pipeline

GUI grounding failures under controlled UI perturbations. Data, tooling, and evaluation protocol.

Part 2 · Previous report

Baseline Evaluations

How leading CUA models perform across perturbation types. Structured failure analysis.

Part 3 · This report

Finetuning Experiments

Training on perturbation-augmented data. How does finetuning on training data generated via perturbation affect model failure modes.

Browser-Use & Computer Control Have Cognitive Behavior Gaps

The reflex for an unreliable computer-use agent (CUA) is to write a better prompt. Agent Skills, folders of instructions, scripts, and resources that an agent can discover and call, have made that approach both more capable and more popular [1]. The premise is reasonable: give the agent better instructions and it should behave better. Prompting cannot supply a behavior the model never learned. Consider booking a flight. Without spatial-relation reasoning, the agent cannot tell whether seat 14A or 14C is the window seat. Without multi-region visual reading, it books May 21 instead of June 21 because it pulled the wrong cell from a dense calendar. Without instruction-ambiguity reasoning, it books the first flight in the list rather than asking which one you meant. Without self-reflection, it follows the wrong checkout flow all the way to the end. Without the ability to refute a premise, it loops forever hunting a menu item that no longer exists, or carries out a dangerous action because it was told to. Figure 1: Sample 119 of 390, "Click on the button above 'June 19 2023'"The limitations described above are training data problems, not prompting problems. A model picks up the behaviors needed to handle real software only when those behaviors appear in its training data. This post asks one question: can we train these behaviors into a model using GUI-Perturbed data? We find that the obvious approaches fail, and that the way they fail is the useful part. Evaluation Gaps to Training Gaps In Part 2 of this investigation, we found that state-of-the-art GUI models degrade sharply under two conditions: small visual perturbations, and instructions phrased as spatial relations. These models had seen millions of GUI screenshots, yet a change in zoom or a request for "the button above X" was enough to break them. The cause is visible in how CUA training data is usually organized. Standard recipes sort data by surface category: platform, action type, application, UI element type [3-5, 7], and try to maximize diversity along those axes. The gaps Part 2 exposed do not lie on those axes. They are gaps in cognitive behavioral coverage: spatial reasoning, instruction disambiguation, invariance to visual appearance. A dataset can be exhaustive across platforms and applications and still contain almost no examples that demand reasoning about where one element sits relative to another.

0:00

/0:05

1×

Figure 2: Failure modes identified in part 2 vs. training interventions

That points to a direct test: If the gaps are behavioral, can we build training data that targets the missing behaviors and fills them? As a first step, we study how synthetic grounding data, generated to exercise exactly these behaviors, affects a state-of-the-art model. Why GUI Training Data is Hard to Get Right Collection is Expensive & Synthesis is Fragile There are two ways to get more grounding data, and each has a characteristic failure mode. Real trajectories are expensive. Collecting real interaction traces at scale is costly. OpenCUA [6] and the UI-TARS [2] pipeline show what is achievable, but the cost per trajectory stays high and the datasets stay narrow in behavioral diversity. Synthetic data is fragile. Generating data synthetically is the obvious alternative, and it brings its own risk. The Jedi dataset is the cautionary case: synthetic trajectories can look plausible while encoding shortcuts and rendering artifacts that do not transfer to real use,...

Fixing Failures in Browser-Use Models: Why More Data Isn't Enough

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars