Fixing Failures in Browser-Use Models: Why More Data Isn't Enough
Yangyue Wang1, 2 ,<br>Harshvardhan Sikka1, 2 ,<br>Yash Mathur*2 ,<br>Tony Zhou*2 ,<br>Jinu Nyachhyon*2 ,<br>Pranav Guruprasad1, 2
* Equal contributions.<br>1Fig;<br>2Manifold Research Group.
0:00
/0:08
1×
Relevant links<br>Models<br>Dataset<br>Paper<br>Demo<br>Code<br>Cite this
TL;DR
We ran three LoRA fine-tuning experiments: varying perturbation type, data scale, and real vs. synthetic sources.
Counter-intuitively, augmentation degrades performance rather than improving it. We find this points to issues in model representations and standard fine-tuning methodologies instead of the data itself
We introduce the fine-tuned 7B GUI model trained on GUI-DR generated data to study the effects of synthetic data on the model's GUI grounding capability in supervised post-training.
Key Sections<br>GUI model skill gaps<br>Experimental setup<br>Three experiments, three surprises<br>Discussion<br>What's next
GUI Perturbation — Research Series
Part 1 · Previous report
Dataset Release & Data Augmentation Pipeline
GUI grounding failures under controlled UI perturbations. Data, tooling, and evaluation protocol.
Part 2 · Previous report
Baseline Evaluations
How leading CUA models perform across perturbation types. Structured failure analysis.
Part 3 · This report
Finetuning Experiments
Training on perturbation-augmented data. How does finetuning on training data generated via perturbation affect model failure modes.
Browser-Use & Computer Control Have Cognitive Behavior Gaps
The reflex for an unreliable computer-use agent (CUA) is to write a better prompt. Agent Skills, folders of instructions, scripts, and resources that an agent can discover and call, have made that approach both more capable and more popular [1]. The premise is reasonable: give the agent better instructions and it should behave better.<br>Prompting cannot supply a behavior the model never learned. Consider booking a flight. Without spatial-relation reasoning, the agent cannot tell whether seat 14A or 14C is the window seat. Without multi-region visual reading, it books May 21 instead of June 21 because it pulled the wrong cell from a dense calendar. Without instruction-ambiguity reasoning, it books the first flight in the list rather than asking which one you meant. Without self-reflection, it follows the wrong checkout flow all the way to the end. Without the ability to refute a premise, it loops forever hunting a menu item that no longer exists, or carries out a dangerous action because it was told to.<br>Figure 1: Sample 119 of 390, "Click on the button above 'June 19 2023'"The limitations described above are training data problems, not prompting problems. A model picks up the behaviors needed to handle real software only when those behaviors appear in its training data. This post asks one question: can we train these behaviors into a model using GUI-Perturbed data? We find that the obvious approaches fail, and that the way they fail is the useful part.<br>Evaluation Gaps to Training Gaps<br>In Part 2 of this investigation, we found that state-of-the-art GUI models degrade sharply under two conditions: small visual perturbations, and instructions phrased as spatial relations. These models had seen millions of GUI screenshots, yet a change in zoom or a request for "the button above X" was enough to break them.<br>The cause is visible in how CUA training data is usually organized. Standard recipes sort data by surface category: platform, action type, application, UI element type [3-5, 7], and try to maximize diversity along those axes. The gaps Part 2 exposed do not lie on those axes. They are gaps in cognitive behavioral coverage: spatial reasoning, instruction disambiguation, invariance to visual appearance. A dataset can be exhaustive across platforms and applications and still contain almost no examples that demand reasoning about where one element sits relative to another.
0:00
/0:05
1×
Figure 2: Failure modes identified in part 2 vs. training interventions
That points to a direct test: If the gaps are behavioral, can we build training data that targets the missing behaviors and fills them?<br>As a first step, we study how synthetic grounding data, generated to exercise exactly these behaviors, affects a state-of-the-art model.<br>Why GUI Training Data is Hard to Get Right<br>Collection is Expensive & Synthesis is Fragile<br>There are two ways to get more grounding data, and each has a characteristic failure mode.<br>Real trajectories are expensive. Collecting real interaction traces at scale is costly. OpenCUA [6] and the UI-TARS [2] pipeline show what is achievable, but the cost per trajectory stays high and the datasets stay narrow in behavioral diversity.<br>Synthetic data is fragile. Generating data synthetically is the obvious alternative, and it brings its own risk. The Jedi dataset is the cautionary case: synthetic trajectories can look plausible while encoding shortcuts and rendering artifacts that do not transfer to real use,...