PhoneBuddy: Training Open Models for Agentic Phone Use
Open Phone-Use Agents
PhoneBuddy: Training Open Models for Agentic Phone Use
PhoneBuddy studies how to train open phone-use agents with both real-app execution and scalable mock-app environments. The core result is simple: real-app RL provides realism, while PhoneWorld-style mock-app training adds resettable and automatically verified interaction signal.
📄 Paper PDF<br>📝 arXiv · Coming Soon<br>🤗 HuggingFace Paper · Coming Soon<br>🤗 HuggingFace Model<br>💻 GitHub
🤗 PhoneBuddy-4B Real+Mock<br>🤗 PhoneBuddy-4B-RealApp Real-only<br>🤗 PhoneBuddy-0.8B Real+Mock
4B open phone-use model line
150 real-phone human-evaluation tasks
83.2% AndroidWorld success rate after Real+Mock RL
+5.0 average gain over real-app RL alone
Phone-Agent Research Line
PhoneBuddy sits in a broader phone-agent stack: environments for scalable interaction, training recipes for open models, runtime harnesses for real execution, and deployment boundaries for privacy and safety.
Phone-Agent Stack
TrainingPhoneBuddy
Training open phone-use models with real-app RL and PhoneWorld-style mock-app training, showing that realism and scalable verified interaction are complementary.
Paper<br>Code<br>Models
EnvironmentPhoneWorld
A scalable pipeline that turns real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, verifiers, and training rollouts.
Paper<br>中文Blog
RuntimePhoneHarness
A mixed-action phone-agent harness and benchmark that routes across CLI, GUI, and MCP tools with trace-backed verification.
Project<br>Paper<br>Code<br>Dataset<br>中文 Blog
PrivacyPhonePrivacy
A verifiable benchmark for privacy behavior in mobile agents, auditing permissioned access, minimal disclosure, and user-controlled memory.
Paper<br>中文Blog
SafetyPhoneSafety
Safety evaluation for phone-use agents, separating true safety behavior from simple incapability and checking risky mobile side effects.
Paper<br>Code
Results
PhoneBuddy-4B-Real+Mock improves the same open-model line across real-phone and AndroidWorld settings. The visual summary below separates main capability comparison from the cross-app limitation.
AndroidWorld<br>PB 83.2<br>Average<br>PB 54.8<br>Mini-App<br>PB 56.0<br>Cross-App<br>PB 18.0<br>Single-App<br>PB 62.0
Capability Profile
The radar compares PhoneBuddy-4B-Real+Mock against GPT-5.4, Gemini 3.1 Pro, and Seed 2.0 across the same five axes. PhoneBuddy is strongest on AndroidWorld and single-app tasks, while cross-app transfer remains the visible weak axis.
PhoneBuddy<br>GPT-5.4<br>Gemini<br>Seed
Average Success Rate
42.6
PhoneBuddy<br>SFT
48.2
GPT-5.4
49.8
PhoneBuddy<br>Real
51.4
Seed 2.0
54.8
PhoneBuddy<br>Real+Mock
59.1
Gemini<br>3.1 Pro
Mixed real+mock training is the strongest PhoneBuddy checkpoint and closes much of the gap to larger proprietary agents.
AndroidWorld
60.3
PhoneBuddy<br>SFT
70.7
GPT-5.4
71.5
Seed 2.0
77.2
PhoneBuddy<br>Real
80.2
Gemini<br>3.1 Pro
83.2
PhoneBuddy<br>Real+Mock
AndroidWorld shows a clean progression from SFT to real-app RL to mixed real+mock RL.
Single-App Tasks
34.0
PhoneBuddy<br>SFT
44.0
Seed 2.0
50.0
GPT-5.4
50.0
Gemini<br>3.1 Pro
54.0
PhoneBuddy<br>Real
62.0
PhoneBuddy<br>Real+Mock
GPT and Gemini are shown separately; both score 50.0 on this slice.
WeChat Mini-App Tasks
40.0
GPT-5.4
48.0
PhoneBuddy<br>Real
54.0
PhoneBuddy<br>SFT
56.0
PhoneBuddy<br>Real+Mock
58.0
Gemini<br>3.1 Pro
60.0
Seed 2.0
Mock-app practice helps recover mini-app reliability after real-only training.
ModelSingle-AppCross-AppWeChat Mini-AppAndroidWorldAvg.
PhoneBuddy-4B-SFT34.022.054.060.342.6<br>PhoneBuddy-4B-Real54.020.048.077.249.8<br>PhoneBuddy-4B-Real+Mock 62.0 18.056.083.2 54.8
Evaluation covers real-phone Single-App, Cross-App, WeChat Mini-App tasks, plus AndroidWorld.
Delta view of real-app RL over SFT and mixed real+mock RL over real-app RL.
Method
PhoneBuddy compares a shared SFT checkpoint, a real-app RL checkpoint, and a mixed real+mock RL checkpoint under the same backbone, action interface, and evaluation protocol.
Real-App Environment<br>Real devices and authentic apps expose account state, app logic, timing variation, permission flows, and real side effects.
PhoneWorld<br>Runnable mock apps reconstructed from real GUI usage structure provide resettable training tasks and automatic verification.
Real + Mock RL<br>The final branch keeps real execution in the loop while adding scalable mock-app interaction for broader and cheaper training signal.
Real and Mock Environments
The real-app environment anchors training to authentic behavior; the mock-app environment contributes scale, reset, and automatic checking.
Qualitative Examples
Representative successful trajectories show structured workflows where mock-app training improves execution reliability.
Limitation & Future Work: Cross-App Generalization
PhoneWorld-style mock tasks currently emphasize single-app interaction rather than cross-app handoff. The ablation therefore compares only PhoneBuddy checkpoints and shows that...