Hi HN, I m Smyan and I enjoy building agents. Modern multimodal LLMs are great at vision and perception but are quite poor at localization. This naturally creates a massive problem when we try to take our RPA frameworks and give them to agents to perform computer use tasks.For browsers, we have been able to solve this by using the DOM tree to supply the LLM with structural hints and now more recently modern browser use frameworks use Set-Of-Marks prompting which take the structural information of the webpage and convert it into visual bounding boxes with labels, which allow the LLM to use it s strong vision and perception and accurately convert it to a form of localization. Functionally, this means the LLM now needs to simply say click 4 instead of having to say click 443 213 .This methodology however fails horribly when we try to apply it to native OS automation. The accessibility tree, which is often exists for native apps, is usually quite brittle, exposes non-deterministic selectors and often stripped by developers, which can make it hard to localize elements. Fuzzy matching can help with this, but it is still none the less very hard to get right.This is exactly why I made SoMatic. SoMatic is a pure vision based framework that uses a finetuned YOLO model (highly inspired from OmniParser v2) to identify text and interactable elements in a UI. SoMatic draws the bounding boxes and labels and then maps the id for each bounding box to the coordinates for the center of the given box. This therefore enables Set-Of-Marks prompting for in principal ANY user interface.I ran an ablation benchmark using the framework with GPT-5.5 (high) and was able to acquire a ~ 20% higher accuracy than just the raw model. What was however surprising was that the model performed slightly better with knowing just the location of the bounding boxes (without actually seeing them). This could be due to the threshold tuning for the YOLO model either drawing too many or too few boxes (I m not entirely sure).Either way, if you have been wanting to give your AI agents full autonomy of your computer (Windows, Mac and Linux), you can download the CLI with npm install -g somatic-cli/cli and the corresponding skill with npx skills add Smyan1909/SoMatic The CLI also comes with a stdio MCP server if you want the model to directly parse the screenshots (b64 encoded) from the chosen API instead of it having to read the image after each screenshot.I d love to get your feedback on the vision-only approach. Are we at the point where we can finally abandon the mess that is the OS accessibility tree for automation?