Cross-platform desktop automation through accessibility APIs

lukastyrychtr1 pts0 comments

Cross-platform desktop automation through accessibility APIs - crowecawcaw blogCross-platform desktop automation through accessibility APIs<br>2026-05-30<br>xa11y provides a Playwright-style API for driving desktop applications through their accessibility tree on Windows, macOS, and Linux. It&rsquo;s a Rust library with additional Python and JavaScript bindings.<br>The library provides a foundation for desktop testing, automation, and accessibility software. Additionally, these accessibility APIs are a more robust mechanism for building computer use agents, which until recently have relied primarily on running vision models on screenshots (flaky, slow, token heavy).<br>The simple interface is modeled after Playwright and CSS selectors. For example, to drive the macOS Calculator app:<br>use xa11y::*;<br>use std::time::Duration;

let calc = App::by_name("Calculator", Duration::from_secs(5))?;<br>calc.locator("button[name='7']").press()?;<br>calc.locator("button[name='+']").press()?;<br>calc.locator("button[name='3']").press()?;<br>calc.locator("button[name='=']").press()?;

let display = calc.locator("static_text").first().element()?;<br>assert_eq!(display.data().value.as_deref(), Some("10"));

Under the hood, the Cargo workspace splits along platform lines: xa11y-core holds the shared types and selector engine; xa11y-windows, xa11y-macos, and xa11y-linux each wrap one platform&rsquo;s FFI (COM via the windows crate, Core Foundation via core-foundation, and D-Bus via zbus); and the top-level xa11y crate conditionally compiles the right backend per target. The Python and Node packages are separate crates layered on top via pyo3 and napi-rs. Isolating each FFI surface in its own crate keeps unsafe and platform cfgs out of xa11y-core and lets each backend evolve against its own native idioms.<br>The library wrangles a lot of complexity to produce this simple interface. Each platform has its own unique accessibility system - UIA on Windows, AXUIElement on macOS, and AT-SPI2 on Linux - which have different semantics, query patterns, and performance. Accessibility trees change as an interface rerenders, making it challenging to route an action to the right element.<br>Cross-platform differences<br>Every platform has accessibility APIs that are conceptually similar at a high level: desktop UIs are represented by trees of elements which have various properties and which can accept updates and actions. The Calculator snippet above, for instance, walks a tree that looks roughly like this:<br>application "Calculator"<br>└── window "Calculator"<br>├── static_text "10"<br>└── group<br>├── button "7"<br>├── button "8"<br>├── button "9"<br>├── button "+"<br>├── button "3"<br>├── button "="<br>└── ... (other digits and operators)<br>Up close though, the APIs diverge.<br>Windows UIA has the most structured data model. Each UI element is assigned a role from a fixed list and supports a set of actions (&ldquo;control patterns&rdquo;) also from a fixed list. Because roles and actions are standardized enums, the elements are easier to programmatically interpret. For reading accessibility data, Windows can prefetch a whole subtree in a single call (FindAllBuildCache + CacheRequest) which is quick and efficient.<br>On macOS, the AXUIElement data model is more flexible. UI elements are identified by role and subrole strings (e.g. role &ldquo;AXButton&rdquo; with subrole &ldquo;AXDisclosureTriangle&rdquo;). There are conventions for what these strings should be (e.g. most start with &ldquo;AX&rdquo;, buttons are usually &ldquo;AXButton&rdquo;), but the conventions are not enforced. As a result, looking at the accessibility data we can ask, &ldquo;Is this element a button?&rdquo; and evaluate some heuristics, but ultimately the data is untyped and can contain any value from the desktop application.<br>AXUIElements accept actions in two ways. Some actions like updating the text in an input box are done as a property update on the element. Others like pressing a button are done by invoking an action. Like roles, action names are untyped strings with conventions but no rules.<br>When reading data, macOS does not have an API for reading the entire tree. Instead, each element attribute needs to be individually read. Fortunately, these calls are relatively quick and attributes for a single element can be read in a single batched AXUIElementCopyMultipleAttributeValues call.<br>Linux&rsquo;s AT-SPI2 system sits in between Windows UIA&rsquo;s structures and macOS&rsquo;s flexibility. AT-SPI2 has strong conventions for element roles, which are modeled as an enum (with a registration mechanism for custom roles), while action names are untyped strings with conventional values like &ldquo;click&rdquo; or &ldquo;press&rdquo;, but no enforced enum. Most UIs use standard role and action names, but we still need a way to handle custom ones. The main challenge I found with AT-SPI2 is its performance. Windows UIA supports reading a whole a11y tree at once and macOS&rsquo;s AXUIElement API allows batch reads, but AT-SPI2 requires individual API...

button accessibility platform xa11y windows macos

Related Articles