Domain Randomization for Computer Control
Yangyue Wang1, 2 ,<br>Harshvardhan Sikka1, 2 ,<br>Yash Mathur*2 ,<br>Tony Zhou*2 ,<br>Jinu Nyachhyon*2 ,<br>Pranav Guruprasad1, 2
* Equal contributions.<br>1Fig;<br>2Manifold Research Group.
0:00
/0:09
1×
GUI-DR restyles, repositions, and removes DOM elements on real webpages
TL;DR
GUI models scoring 90%+ on standard benchmarks fail under basic visual variations like a 70% browser zoom. Current benchmarks can't detect this because they evaluate on fixed scenes with fixed instructions. We quantify how far performance drops in Part 2. Subscribe to get it in your inbox.
Our stress-testing framework for GUI grounding applies domain randomization from robotics, varying visual scenes and instructions along controlled axes to expose fragile model behaviors.
We introduce GUI-DR, an open-source data augmentation pipeline for generating perturbation variants from real web pages.
Key Sections<br>The White Rectangle Problem<br>Building GUI-Perturbed<br>Dataset at a Glance<br>Get involved
Relevant links<br>Code<br>Cite this
The White Rectangle Problem
Modern GUI grounding models can locate a “Submit” button with high precision, identify form fields from natural-language instructions, and navigate complex web interfaces. Yet they confuse a browser's search bar with the formula bar in Google Sheets. Both are white rectangles near the top of the screen. Mistakes like these are the demo-to-production gap that keeps GUI models stuck in the lab.<br>Figure 1: OpenAI's Operator confuses the browser search bar with the Google Sheets formula bar mid-task. Both are white rectangles near the top of the screen.This is a systematic failure: models ground to visual primitives like shape, position, and color rather than functional semantics [17]. A white rectangle at the top of the screen represents “text input,” regardless of whether it is a search bar, a formula bar, or a URL field. The model has skewed representation of what the element might do.<br>Current evaluation datasets can't tell us how widespread the white rectangle problem is [1,3-12]. They evaluate on fixed scenes with fixed instructions: a specific screenshot, a referring expression, a single correct answer. That measures peak performance under curated conditions, not how models degrade when layout, zoom, or wording shift, which is much closer to production.<br>The question is whether we can measure grounding robustness systematically:<br>Instead of only measuring peak accuracy on a fixed scene, can we measure how models hold up as scenes and instructions vary?<br>In this technical report, we introduce GUI-Perturbed, a dataset built on domain randomization principles that varies visual scenes and instructions along controlled axes to expose fragile grounding. We describe the dataset, the perturbation methodology, and the design decisions behind it.<br>Fixed Scenes Hide Fragile Models<br>Existing computer-using agent (CUA) evaluation datasets share a common structure: a fixed screenshot, a fixed instruction, and a fixed ground-truth target. Benchmarks like OSWorld [3], ScreenSpot-v2 [5], ScreenSpot-Pro [6], and OSWorld-G [4] each contribute valuable coverage of specific scenarios and applications. But they all evaluate under the same assumption: that the test set’s visual scene and instruction distribution is representative of real world scenarios.<br>In production, this assumption breaks constantly. Websites ship new themes. Browser zoom levels vary across users. Dark mode inverts color relationships. Users describe the same element in different ways depending on context. A model that scores 90% on a fixed test set may score far lower once any of these variables shift.<br>Figure 2: GUI agent dataset comparison [1,3-12]. Scene variability: Fixed = no variation; Live = uncontrolled real-world changes; Perturbed = controlled variation. GUI-Perturbed† is web-only; cross-platform is left for future work.What we need is evaluation data that varies these conditions systematically, so we can measure robustness, not only peak performance. For this we borrow a technique from robotics: domain randomization.
GUI Perturbation — Research Series
Part 1 · This report
Data Augmentation Pipeline
GUI grounding failures under controlled UI perturbations. Data, tooling, and evaluation protocol.
Part 2 · Read Now
Dataset Release & Baseline Evaluations
How leading CUA models perform across perturbation types. Structured failure analysis.
Part 3 · Coming soon
Fine-tuning Experiments & Model Checkpoint
Training on perturbation-augmented data. How does fine-tuning on training data generated via perturbation affect model failure modes.
Sim-to-Real to Demo-to-Production<br>Domain randomization is a standard technique for bridging the gap between simulation and the real world [13]. During training, we randomize visual properties of the simulator (textures, lighting, object colors, camera angles) so the policy is forced to learn features that are invariant to surface-level variation. A robot that has...