MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
Skip to paper
MobileGym<br>Live demo
arXiv
Power off
Click to start
State Builder
Patch state
Gestures<br>or click
Back<br>Swipe in from left or right edge
Home<br>Swipe up from bottom edge
Recents<br>Swipe up & hold mid-screen
with the .gesture-guide-item--hint modifier<br>suppressing hover/active affordances. -->
Switch pages<br>Swipe or drag horizontally on the desktop
State Builder · Live Injection
Patch runtime state without restarting the device
Snapshot · time-travel · inject cross-app data while the simulator keeps running.
Session
Language
Device
Alipay
SMS
12306
Weather
Reset<br>Save<br>Restore
Reset clears the simulator and reloads a fresh session. Saved snapshots stay in this browser.
Reset simulator
Snapshot name
Save snapshot
Snapshot<br>No snapshots yet
Restore snapshot<br>Delete snapshot
Sets the simulated phone's language. Also clears per-app overrides (Alipay, Bilibili, RedBook, Map) so every app follows the new system locale.
中文<br>Simplified Chinese
English<br>United States
Apply language
Time<br>Battery<br>Location
Mode
Real<br>Simulated
Flow<br>Flow
Device time
Apply time
Battery
Status
Unplugged<br>Charging<br>Fast charge
Battery saver
Off<br>On
Apply battery
Preset
Beijing<br>Shanghai<br>Guangzhou<br>Shenzhen<br>Hangzhou<br>Chengdu<br>Wuhan<br>Nanjing<br>New York<br>London<br>Tokyo<br>Custom coordinates
Latitude
Longitude
Apply location
Message<br>Contact
Chat<br>陈静
Message time
Incoming message<br>今晚 7 点开会,记得带材料。
Insert message
New contact
WeChat ID
Avatar
Signature
Add contact
Balance<br>Bill
Account balance
Set balance
Bill amount
Type
Expense<br>Income
Bill title
Note
Add bill
Contact<br>SMS
Contact name
Phone
Save contact
SMS sender
Message content<br>您的验证码是 482913,用于 MobileGym 场景演示,请勿泄露。
Receive SMS
Ticket order
Train
Status
Paid<br>Unpaid<br>Cancelled
From station
To station
Date
Depart
Arrive
Passenger
Price
Seat type
商务座<br>一等座<br>二等座<br>高级软卧<br>软卧<br>硬卧<br>软座<br>硬座<br>无座
Seat
Add ticket/order
Conditions
Patches the located page's current weather. Switch city via the Device → Location panel.
Condition
晴 (Sunny)<br>多云 (Partly cloudy)<br>阴 (Overcast)<br>小雨 (Light rain)<br>大雨 (Heavy rain)<br>雷阵雨 (Thunderstorm)<br>小雪 (Light snow)<br>雾 (Fog)<br>霾 (Haze)
Temp °C
High °C
Low °C
AQI (0–500)
Apply weather
Power on the phone, then apply a scenario.
One setState patch per action · persists across reloads
Scroll to read paper
TL;DR
MobileGym is a verifiable and highly parallel simulation platform for mobile GUI agent research — the first to make online RL training and deterministic evaluation feasible on real-world daily apps, long a structural blind spot of real-device pipelines. It covers 28 mobile apps (12 daily + 16 system) in the browser. Across the released validation suite, programmatic state judges show no false accept/reject cases over 416 parameterized task templates (vs. 10.2% misjudgment when the same real-device trajectories are scored by a VLM), giving a clean RL reward signal; structured state replication (∼400 MB per browser instance) makes single-machine batch-parallel GRPO cheap. Sim-to-Real: GRPO fine-tuning of Qwen3-VL-4B lifts overall simulation SR by +12.8 pt (9.4%→22.2%); on the 59-task real-device-runnable signal-bucket subset, the +42.8 pt simulation gain is preserved as +40.7 pt on the real device — 95.1% retention.
28
Apps simulated<br>12 daily + 16 system
416
Parameterized task templates<br>256 test + 160 train
False accept/reject<br>released checks vs. 10.2% VLM judge error
+40.7pt
Real-device gain<br>Qwen3-VL-4B trained on sim
Inside the Sandbox: 28 Apps
Each app is a faithful in-browser re-implementation in React/TypeScript, with Android-style task stacks, Intent routing, ContentProviders, and permission flows. Hover a row to pause.
Daily<br>12
Alipay
RedNote
Bilibili
Railway
Maps
Spotify
eBay
WeRead
Meeting
Alipay
RedNote
Bilibili
Railway
Maps
Spotify
eBay
WeRead
Meeting
System<br>16
Launcher
Settings
Phone
Messages
Calendar
Clock
Weather
Gallery
Calculator
Calc (AOSP)
Compass
Notes
Files
Themes
Browser
AnswerSheet
Launcher
Settings
Phone
Messages
Calendar
Clock
Weather
Gallery
Calculator
Calc (AOSP)
Compass
Notes
Files
Themes
Browser
AnswerSheet
All registered via manifest auto-discovery — adding a new app needs zero changes to the OS or benchmark layer. ~3–4 person-days per daily app,
Why daily apps stay out of reach
Real-world apps are<br>unreadable,<br>unresettable,<br>and unforgiving.
That's why benchmarks quietly avoid WeChat, Alipay, and 12306 — and why online RL on the apps users actually live in has barely been attempted at scale. Three structural walls in the real-device pipeline:
01
Can't read it
The screen is a summary, not the record
Did the transfer go to the right "Mom" — or the other contact with the same nickname? Did the cart settle on the SKU the user wanted, or the lookalike variant? Is the post actually live on the...