PhoneBuddy: Training Open Models for Agentic Phone Use

Open Phone-Use Agents

PhoneBuddy studies how to train open phone-use agents with both real-app execution and scalable mock-app environments. The core result is simple: real-app RL provides realism, while PhoneWorld-style mock-app training adds resettable and automatically verified interaction signal.

📄 Paper PDF 📝 arXiv · Coming Soon 🤗 HuggingFace Paper · Coming Soon 🤗 HuggingFace Model 💻 GitHub

🤗 PhoneBuddy-4B Real+Mock 🤗 PhoneBuddy-4B-RealApp Real-only 🤗 PhoneBuddy-0.8B Real+Mock

4B open phone-use model line

150 real-phone human-evaluation tasks

83.2% AndroidWorld success rate after Real+Mock RL

+5.0 average gain over real-app RL alone

Phone-Agent Research Line

PhoneBuddy sits in a broader phone-agent stack: environments for scalable interaction, training recipes for open models, runtime harnesses for real execution, and deployment boundaries for privacy and safety.

Phone-Agent Stack

TrainingPhoneBuddy

Training open phone-use models with real-app RL and PhoneWorld-style mock-app training, showing that realism and scalable verified interaction are complementary.

Paper Code Models

EnvironmentPhoneWorld

A scalable pipeline that turns real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, verifiers, and training rollouts.

Paper 中文Blog

RuntimePhoneHarness

A mixed-action phone-agent harness and benchmark that routes across CLI, GUI, and MCP tools with trace-backed verification.

Project Paper Code Dataset 中文 Blog

PrivacyPhonePrivacy

A verifiable benchmark for privacy behavior in mobile agents, auditing permissioned access, minimal disclosure, and user-controlled memory.

Paper 中文Blog

SafetyPhoneSafety

Safety evaluation for phone-use agents, separating true safety behavior from simple incapability and checking risky mobile side effects.

Paper Code

Results

PhoneBuddy-4B-Real+Mock improves the same open-model line across real-phone and AndroidWorld settings. The visual summary below separates main capability comparison from the cross-app limitation.

AndroidWorld PB 83.2 Average PB 54.8 Mini-App PB 56.0 Cross-App PB 18.0 Single-App PB 62.0

Capability Profile

The radar compares PhoneBuddy-4B-Real+Mock against GPT-5.4, Gemini 3.1 Pro, and Seed 2.0 across the same five axes. PhoneBuddy is strongest on AndroidWorld and single-app tasks, while cross-app transfer remains the visible weak axis.

PhoneBuddy GPT-5.4 Gemini Seed

Average Success Rate

42.6

PhoneBuddy SFT

48.2

GPT-5.4

49.8

PhoneBuddy Real

51.4

Seed 2.0

54.8

PhoneBuddy Real+Mock

59.1

Gemini 3.1 Pro

Mixed real+mock training is the strongest PhoneBuddy checkpoint and closes much of the gap to larger proprietary agents.

AndroidWorld

60.3

PhoneBuddy SFT

70.7

GPT-5.4

71.5

Seed 2.0

77.2

PhoneBuddy Real

80.2

Gemini 3.1 Pro

83.2

PhoneBuddy Real+Mock

AndroidWorld shows a clean progression from SFT to real-app RL to mixed real+mock RL.

Single-App Tasks

34.0

PhoneBuddy SFT

44.0

Seed 2.0

50.0

GPT-5.4

50.0

Gemini 3.1 Pro

54.0

PhoneBuddy Real

62.0

PhoneBuddy Real+Mock

GPT and Gemini are shown separately; both score 50.0 on this slice.

WeChat Mini-App Tasks

40.0

GPT-5.4

48.0

PhoneBuddy Real

54.0

PhoneBuddy SFT

56.0

PhoneBuddy Real+Mock

58.0

Gemini 3.1 Pro

60.0

Seed 2.0

Mock-app practice helps recover mini-app reliability after real-only training.

ModelSingle-AppCross-AppWeChat Mini-AppAndroidWorldAvg.

PhoneBuddy-4B-SFT34.022.054.060.342.6 PhoneBuddy-4B-Real54.020.048.077.249.8 PhoneBuddy-4B-Real+Mock 62.0 18.056.083.2 54.8

Evaluation covers real-phone Single-App, Cross-App, WeChat Mini-App tasks, plus AndroidWorld.

Delta view of real-app RL over SFT and mixed real+mock RL over real-app RL.

Method

PhoneBuddy compares a shared SFT checkpoint, a real-app RL checkpoint, and a mixed real+mock RL checkpoint under the same backbone, action interface, and evaluation protocol.

Real-App Environment Real devices and authentic apps expose account state, app logic, timing variation, permission flows, and real side effects.

PhoneWorld Runnable mock apps reconstructed from real GUI usage structure provide resettable training tasks and automatic verification.

Real + Mock RL The final branch keeps real execution in the loop while adding scalable mock-app interaction for broader and cheaper training signal.

Real and Mock Environments

The real-app environment anchors training to authentic behavior; the mock-app environment contributes scale, reset, and automatic checking.

Qualitative Examples

Representative successful trajectories show structured workflows where mock-app training improves execution reliability.

Limitation & Future Work: Cross-App Generalization

PhoneWorld-style mock tasks currently emphasize single-app interaction rather than cross-app handoff. The ablation therefore compares only PhoneBuddy checkpoints and shows that...

PhoneBuddy: Training Open Models for Agentic Phone Use

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI