An Android UI dump for LLMs (10× fewer tokens, same actions) - Handsets
Skip to content
Initializing search
elliotgao2/handsets
An Android UI dump for LLMs (10× fewer tokens, same actions)¶
When an LLM agent drives an Android device, the loop looks like:
Take a screenshot (or screen description)
Decide what to tap
Tap it
Repeat
For step 1, the canonical answer is uiautomator dump — the XML<br>hierarchy that uiautomator2, Appium, and most Android automation<br>tools return. On the Settings home screen of an emulator I just<br>measured, that XML is 22.3 KB / 5,762 GPT-4 tokens .
The same screen rendered through Handsets' hs ui -i is 3.3 KB /<br>729 tokens — about 8× fewer tokens, and 10–13× on simpler screens .<br>The agent's decision quality doesn't change.
Here's how we got there, and why every byte that disappeared is a byte<br>the LLM didn't need.
Three screens, two formats¶
screen<br>uiautomator dump (XML)<br>hs ui -i<br>ratio
Launcher home<br>12.0 KB / 3,153 tok<br>1.1 KB / 246 tok<br>12.8×
Settings home<br>22.3 KB / 5,762 tok<br>3.3 KB / 729 tok<br>7.9×
Settings → Apps<br>15.2 KB / 4,050 tok<br>0.9 KB / 320 tok<br>12.7×
Token counts are from tiktoken with the GPT-4 encoding; reproducer at<br>the bottom. The ratio is bigger on screens where the layout tree is<br>deeper than the labeled content, and smaller on screens like Settings<br>home where almost every label is a real TextView with a real id.
A typical agent loop step now carries ~1k tokens of UI dump instead of<br>~5k. Across a 50-step trajectory that's an order of magnitude less<br>context per loop — which is real money once you're paying per token.
The XML you start with¶
The first ~1.2 KB of uiautomator dump from the launcher home — which<br>covers exactly the outer three layout nodes of the tree:
rotation="0"><br>index="0" text="" resource-id="" class="android.widget.FrameLayout"<br>package="com.google.android.apps.nexuslauncher" content-desc=""<br>checkable="false" checked="false" clickable="false" enabled="true"<br>focusable="false" focused="false" scrollable="false"<br>long-clickable="false" password="false" selected="false"<br>bounds="[0,0][1440,3120]"><br>index="0" text="" resource-id="" class="android.widget.LinearLayout"<br>package="com.google.android.apps.nexuslauncher" content-desc=""<br>checkable="false" ... ><br>index="0" text="" resource-id="android:id/content"<br>class="android.widget.FrameLayout" ... >
These three nodes are 100% noise for an agent. They have no text, no<br>content description, no interactivity, no clickable affordance. They<br>exist because Android renders surfaces by nesting FrameLayout inside<br>LinearLayout inside FrameLayout. A clickable="false" attribute is<br>a string the LLM has to read in order to learn that nothing is<br>happening here.
The rest of the dump is the same pattern, deeper. By the time the XML<br>reaches an actual tappable widget — say, a TextView with<br>text="Phone" — it has accumulated a dozen ancestors and about 600<br>bytes of structural padding.
The flat table you want¶
Here is hs ui -i for the entire launcher home screen :
@(720,383) long ViewPager #smartspace_card_pager desc="At a glance"<br>@(279,374) click TextView #date "Fri, May 22"<br>@(555,2063) click,long TextView "Gmail"<br>@(884,2063) click,long TextView "Photos"<br>@(1213,2063) click,long TextView "YouTube"<br>@(720,1590) View desc="Home"<br>@(226,2546) click,long TextView "Phone"<br>@(555,2546) click,long TextView "Messages"<br>@(884,2546) click,long TextView "Chrome"<br>@(1213,2546) click,long TextView "YouTube"<br>@(720,2862) click,long FrameLayout #search_container_hotseat desc="Google search"<br>@(218,2862) click ImageView #g_icon desc="Google app"<br>@(1054,2862) click ImageView #mic_icon desc="Voice search"<br>@(1222,2862) click ImageButton #lens_icon desc="Google Lens"
Fourteen lines, 246 tokens. Every line is a thing the agent can decide<br>about. Every line has a coordinate to feed to tap, the action tags,<br>and the label to match against. No closing tags, no namespace prefixes,<br>no attributes whose value is "false".
The four columns, left to right:
Center coordinates — @(x,y). What you tap. Not the bounds<br>rectangle.
Behavior tags — click, long, scroll, check, checked,<br>password. What this widget responds to. Only the positive flags<br>appear.
Class + id — short forms. android.widget.Button collapses to<br>Button; com.android.settings:id/title collapses to #title.
Label — "text" or desc="content-description". The<br>accessibility-curated string a human (and the LLM) actually reads.
What we threw away¶
Six categories of node and attribute disappeared between the XML and<br>the table.
1. Empty layout containers.<br>A FrameLayout / LinearLayout / ConstraintLayout with no text, no<br>content-description, and no clickable/scrollable flag is a<br>structural artifact of Android's renderer. Children carry the labels;<br>the parent's onClick (if any) bubbles up when you tap a child's<br>coords. We drop the entire subtree of layout ancestors.
2. Attributes whose value is the default.<br>checkable="false", enabled="true", focused="false". XML serialises<br>every...