Breaking Browser-Use Models Using Domain Randomization

GUI-Perturbed: Breaking Browser-use Models using Domain Randomization

Yangyue Wang1, 2 , Harshvardhan Sikka1, 2 , Yash Mathur*2 , Tony Zhou*2 , Jinu Nyachhyon*2 , Pranav Guruprasad1, 2

* Equal contributions. 1Fig; 2Manifold Research Group.

0:00

/0:05

1×

TL;DR

We introduce a baseline study of 7B GUI models using GUI-Perturbed to stress test CUA models and understand what agents fail at and why.

A detailed failure mode analysis showcases common GUI failure modes from spatial reasoning, false visual heuristics to CoT reasoning's effect.

Key Sections How do GUI models fail The triple alignment problem Experimental setup Results Discussion Model failure modes

Relevant links Code Dataset Results Viewer Cite this

GUI Perturbation — Research Series

Part 1 · Previous report

Data Augmentation Pipeline

GUI grounding failures under controlled UI perturbations. Data, tooling, and evaluation protocol.

Part 2 · This report

Dataset Release & Baseline Evaluations

How leading CUA models perform across perturbation types. Structured failure analysis.

Part 3 · Coming soon

Fine-tuning Experiments & Model Checkpoint

Training on perturbation-augmented data. How does fine-tuning on training data generated via perturbation affect model failure modes.

GUI models scoring above 90% on ScreenSpot-v2 fail to find the target element when you set web page zoom to 70% [1]. Same website, same layout, same UI elements. Just smaller.

These models were trained on hundreds of thousands to millions of GUI screenshots through supervised fine-tuning and reinforcement learning stages yet they still cannot adapt to a change in zoom.

Original

Model Result

✓ Correct

Precision Variant (70% zoom)

Model Result

✗ Clicked on fake 'View Deal' ad

Figure 1: Sample: 21 of 390: "Click on 'View Deal' button for flight '#2125, #2126'", Direct Instruction, No Reasoning. Result: UI-TARS1.5-7B clicked on the fake 'view deal' button in the ads after the 70% zoom.

GUI grounding looks solved. On fixed-scene benchmarks like ScreenSpot-v2, 7B models now score above 90%, and it is tempting to read those numbers as evidence that perception is no longer the bottleneck for computer-use agents (CUAs). The numbers are real, but they are measured on screens that never move. Real users zoom, restyle, and resize, and production websites are redesigned constantly. The question the industry should care about is not how well a model does on a frozen screenshot, but how much of that performance survives contact with ordinary variation. So we ask one question: how much of a GUI grounding model's benchmark accuracy is stable under perturbation, and how much of it is memorized? Because the three models we study share a base checkpoint but differ in post-training, we can ask a sharper version too: does each additional stage of GUI-specialized post-training buy real robustness, or does it only raise the fixed-scene score? Our goal here is to separate those two things. In Part 1, we introduced GUI-DR, a data augmentation pipeline that varies visual scenes and instructions along controlled axes to stress test CUA model's GUI grounding capability. In this post, we use it to create a dataset composed of visual scene and the instruction variations created along controlled axes. This dataset serves as a benchmark to evaluate three state-of-the-art models that share the same base checkpoint but differ in their post-training recipes, and we report where they break. We find that: Visual perturbations degrade models that benchmarks call production-ready. A change as small as setting browser zoom to 70% drops accuracy by 2 to 6 points across all three models. Spatial relational instructions are the weakest point. Asking for "the button above X" instead of "the submit button" costs 27 to 56 points, the largest single effect we see. Reasoning is not uniformly good. A chain of thought helps on hard relational tasks and hurts on easy direct ones, and a model post-trained for direct coordinate prediction is harmed by it everywhere. More GUI-specialized post-training does not fix any of this. The same weaknesses persist from the base model through two further stages of GUI training. The Triple Alignment Problem

Grounding a GUI instruction is harder than it looks, because the model has to align three different things at once and a benchmark score collapses all three into one number. Visual alignment : identifying an element's appearance in pixel space, its shape, color, size, and boundaries. Functional alignment : knowing what the element does, telling an input field from a display label or a clickable button from a static icon. Geometric alignment : resolving spatial relationships between elements, "above," "next to," "the one between X and Y." Figure 2: Triple alignment in GUI agent perceptionMost benchmarks test the three entangled together [2, 3], so when a model fails the score cannot say which alignment broke. GUI-Perturbed is built to stress visual and geometric...

Breaking Browser-Use Models Using Domain Randomization

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org