PhoneBuddy: Training Open Models for Agentic Phone Use

ilreb1 pts0 comments

PhoneBuddy: Training Open Models for Agentic Phone Use

Open Phone-Use Agents

PhoneBuddy: Training Open Models for Agentic Phone Use

PhoneBuddy studies how to train open phone-use agents with both real-app execution and scalable mock-app environments. The core result is simple: real-app RL provides realism, while PhoneWorld-style mock-app training adds resettable and automatically verified interaction signal.

📄 Paper PDF<br>📝 arXiv · Coming Soon<br>🤗 HuggingFace Paper · Coming Soon<br>🤗 HuggingFace Model<br>💻 GitHub

🤗 PhoneBuddy-4B Real+Mock<br>🤗 PhoneBuddy-4B-RealApp Real-only<br>🤗 PhoneBuddy-0.8B Real+Mock

4B open phone-use model line

150 real-phone human-evaluation tasks

83.2% AndroidWorld success rate after Real+Mock RL

+5.0 average gain over real-app RL alone

Phone-Agent Research Line

PhoneBuddy sits in a broader phone-agent stack: environments for scalable interaction, training recipes for open models, runtime harnesses for real execution, and deployment boundaries for privacy and safety.

Phone-Agent Stack

TrainingPhoneBuddy

Training open phone-use models with real-app RL and PhoneWorld-style mock-app training, showing that realism and scalable verified interaction are complementary.

Paper<br>Code<br>Models

EnvironmentPhoneWorld

A scalable pipeline that turns real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, verifiers, and training rollouts.

Paper<br>中文Blog

RuntimePhoneHarness

A mixed-action phone-agent harness and benchmark that routes across CLI, GUI, and MCP tools with trace-backed verification.

Project<br>Paper<br>Code<br>Dataset<br>中文 Blog

PrivacyPhonePrivacy

A verifiable benchmark for privacy behavior in mobile agents, auditing permissioned access, minimal disclosure, and user-controlled memory.

Paper<br>中文Blog

SafetyPhoneSafety

Safety evaluation for phone-use agents, separating true safety behavior from simple incapability and checking risky mobile side effects.

Paper<br>Code

Results

PhoneBuddy-4B-Real+Mock improves the same open-model line across real-phone and AndroidWorld settings. The visual summary below separates main capability comparison from the cross-app limitation.

AndroidWorld<br>PB 83.2<br>Average<br>PB 54.8<br>Mini-App<br>PB 56.0<br>Cross-App<br>PB 18.0<br>Single-App<br>PB 62.0

Capability Profile

The radar compares PhoneBuddy-4B-Real+Mock against GPT-5.4, Gemini 3.1 Pro, and Seed 2.0 across the same five axes. PhoneBuddy is strongest on AndroidWorld and single-app tasks, while cross-app transfer remains the visible weak axis.

PhoneBuddy<br>GPT-5.4<br>Gemini<br>Seed

Average Success Rate

42.6

PhoneBuddy<br>SFT

48.2

GPT-5.4

49.8

PhoneBuddy<br>Real

51.4

Seed 2.0

54.8

PhoneBuddy<br>Real+Mock

59.1

Gemini<br>3.1 Pro

Mixed real+mock training is the strongest PhoneBuddy checkpoint and closes much of the gap to larger proprietary agents.

AndroidWorld

60.3

PhoneBuddy<br>SFT

70.7

GPT-5.4

71.5

Seed 2.0

77.2

PhoneBuddy<br>Real

80.2

Gemini<br>3.1 Pro

83.2

PhoneBuddy<br>Real+Mock

AndroidWorld shows a clean progression from SFT to real-app RL to mixed real+mock RL.

Single-App Tasks

34.0

PhoneBuddy<br>SFT

44.0

Seed 2.0

50.0

GPT-5.4

50.0

Gemini<br>3.1 Pro

54.0

PhoneBuddy<br>Real

62.0

PhoneBuddy<br>Real+Mock

GPT and Gemini are shown separately; both score 50.0 on this slice.

WeChat Mini-App Tasks

40.0

GPT-5.4

48.0

PhoneBuddy<br>Real

54.0

PhoneBuddy<br>SFT

56.0

PhoneBuddy<br>Real+Mock

58.0

Gemini<br>3.1 Pro

60.0

Seed 2.0

Mock-app practice helps recover mini-app reliability after real-only training.

ModelSingle-AppCross-AppWeChat Mini-AppAndroidWorldAvg.

PhoneBuddy-4B-SFT34.022.054.060.342.6<br>PhoneBuddy-4B-Real54.020.048.077.249.8<br>PhoneBuddy-4B-Real+Mock 62.0 18.056.083.2 54.8

Evaluation covers real-phone Single-App, Cross-App, WeChat Mini-App tasks, plus AndroidWorld.

Delta view of real-app RL over SFT and mixed real+mock RL over real-app RL.

Method

PhoneBuddy compares a shared SFT checkpoint, a real-app RL checkpoint, and a mixed real+mock RL checkpoint under the same backbone, action interface, and evaluation protocol.

Real-App Environment<br>Real devices and authentic apps expose account state, app logic, timing variation, permission flows, and real side effects.

PhoneWorld<br>Runnable mock apps reconstructed from real GUI usage structure provide resettable training tasks and automatic verification.

Real + Mock RL<br>The final branch keeps real execution in the loop while adding scalable mock-app interaction for broader and cheaper training signal.

Real and Mock Environments

The real-app environment anchors training to authentic behavior; the mock-app environment contributes scale, reset, and automatic checking.

Qualitative Examples

Representative successful trajectories show structured workflows where mock-app training improves execution reliability.

Limitation & Future Work: Cross-App Generalization

PhoneWorld-style mock tasks currently emphasize single-app interaction rather than cross-app handoff. The ablation therefore compares only PhoneBuddy checkpoints and shows that...

real phonebuddy mock phone training open

Related Articles