I built an Android-like OS that runs in the browser

haozaz1 pts0 comments

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

Skip to paper

MobileGym<br>Live demo

arXiv

Power off

Click to start

State Builder

Patch state

Gestures<br>or click

Back<br>Swipe in from left or right edge

Home<br>Swipe up from bottom edge

Recents<br>Swipe up & hold mid-screen

with the .gesture-guide-item--hint modifier<br>suppressing hover/active affordances. -->

Switch pages<br>Swipe or drag horizontally on the desktop

State Builder · Live Injection

Patch runtime state without restarting the device

Snapshot · time-travel · inject cross-app data while the simulator keeps running.

Session

Language

Device

WeChat

Alipay

SMS

12306

Weather

Reset<br>Save<br>Restore

Reset clears the simulator and reloads a fresh session. Saved snapshots stay in this browser.

Reset simulator

Snapshot name

Save snapshot

Snapshot<br>No snapshots yet

Restore snapshot<br>Delete snapshot

Sets the simulated phone's language. Also clears per-app overrides (Alipay, Bilibili, RedBook, Map) so every app follows the new system locale.

中文<br>Simplified Chinese

English<br>United States

Apply language

Time<br>Battery<br>Location

Mode

Real<br>Simulated

Flow<br>Flow

Device time

Apply time

Battery

Status

Unplugged<br>Charging<br>Fast charge

Battery saver

Off<br>On

Apply battery

Preset

Beijing<br>Shanghai<br>Guangzhou<br>Shenzhen<br>Hangzhou<br>Chengdu<br>Wuhan<br>Nanjing<br>New York<br>London<br>Tokyo<br>Custom coordinates

Latitude

Longitude

Apply location

Message<br>Contact

Chat<br>陈静

Message time

Incoming message<br>今晚 7 点开会,记得带材料。

Insert message

New contact

WeChat ID

Avatar

Signature

Add contact

Balance<br>Bill

Account balance

Set balance

Bill amount

Type

Expense<br>Income

Bill title

Note

Add bill

Contact<br>SMS

Contact name

Phone

Save contact

SMS sender

Message content<br>您的验证码是 482913,用于 MobileGym 场景演示,请勿泄露。

Receive SMS

Ticket order

Train

Status

Paid<br>Unpaid<br>Cancelled

From station

To station

Date

Depart

Arrive

Passenger

Price

Seat type

商务座<br>一等座<br>二等座<br>高级软卧<br>软卧<br>硬卧<br>软座<br>硬座<br>无座

Seat

Add ticket/order

Conditions

Patches the located page's current weather. Switch city via the Device → Location panel.

Condition

晴 (Sunny)<br>多云 (Partly cloudy)<br>阴 (Overcast)<br>小雨 (Light rain)<br>大雨 (Heavy rain)<br>雷阵雨 (Thunderstorm)<br>小雪 (Light snow)<br>雾 (Fog)<br>霾 (Haze)

Temp °C

High °C

Low °C

AQI (0–500)

Apply weather

Power on the phone, then apply a scenario.

One setState patch per action · persists across reloads

Scroll to read paper

TL;DR

MobileGym is a verifiable and highly parallel simulation platform for mobile GUI agent research — the first to make online RL training and deterministic evaluation feasible on real-world daily apps, long a structural blind spot of real-device pipelines. It covers 28 mobile apps (12 daily + 16 system) in the browser. Across the released validation suite, programmatic state judges show no false accept/reject cases over 416 parameterized task templates (vs. 10.2% misjudgment when the same real-device trajectories are scored by a VLM), giving a clean RL reward signal; structured state replication (∼400 MB per browser instance) makes single-machine batch-parallel GRPO cheap. Sim-to-Real: GRPO fine-tuning of Qwen3-VL-4B lifts overall simulation SR by +12.8 pt (9.4%→22.2%); on the 59-task real-device-runnable signal-bucket subset, the +42.8 pt simulation gain is preserved as +40.7 pt on the real device — 95.1% retention.

28

Apps simulated<br>12 daily + 16 system

416

Parameterized task templates<br>256 test + 160 train

False accept/reject<br>released checks vs. 10.2% VLM judge error

+40.7pt

Real-device gain<br>Qwen3-VL-4B trained on sim

Inside the Sandbox: 28 Apps

Each app is a faithful in-browser re-implementation in React/TypeScript, with Android-style task stacks, Intent routing, ContentProviders, and permission flows. Hover a row to pause.

Daily<br>12

WeChat

Alipay

RedNote

Bilibili

Railway

Maps

Reddit

Spotify

eBay

WeRead

Meeting

WeChat

Alipay

RedNote

Bilibili

Railway

Maps

Reddit

Spotify

eBay

WeRead

Meeting

System<br>16

Launcher

Settings

Phone

Messages

Calendar

Clock

Weather

Gallery

Calculator

Calc (AOSP)

Compass

Notes

Files

Themes

Browser

AnswerSheet

Launcher

Settings

Phone

Messages

Calendar

Clock

Weather

Gallery

Calculator

Calc (AOSP)

Compass

Notes

Files

Themes

Browser

AnswerSheet

All registered via manifest auto-discovery — adding a new app needs zero changes to the OS or benchmark layer. ~3–4 person-days per daily app,

Why daily apps stay out of reach

Real-world apps are<br>unreadable,<br>unresettable,<br>and unforgiving.

That's why benchmarks quietly avoid WeChat, Alipay, and 12306 — and why online RL on the apps users actually live in has barely been attempted at scale. Three structural walls in the real-device pipeline:

01

Can't read it

The screen is a summary, not the record

Did the transfer go to the right "Mom" — or the other contact with the same nickname? Did the cart settle on the SKU the user wanted, or the lookalike variant? Is the post actually live on the...

device real browser contact apps state

Related Articles