Writing a browser-use agent from scratch - Part 1/3 - The Capture
Paul Dufour
SubscribeSign in
Writing a browser-use agent from scratch - Part 1/3 - The Capture<br>There is no better way to learn how a system works than rebuilding it from the ground up. In this article series I tackle a solved problem (browser-use) but re-engineer the entire stack to run in WASM
Paul Dufour<br>Jun 16, 2026
Share
There are a couple steps to creating a AI browser use-agent. This gets extra complicated if you are doing this entirely on the browser (without sending images to a server). At first I thought this might not even be possible, which made it all more fun to come up with a solution.<br>Part of the fun was the novelty - people had done the parts separately before, there were grounding LLM models, there were webpage capturing libraries, but no one had ever brought everything together.<br>The solution I’ve developed is here https://github.com/pdufour/browser-use-wasm and in the next few articles I want to cover the core components that make a browser use agent and what I learned along the way. Follow me on LinkedIn and https://www.linkedin.com/in/pauldufour/ to learn about my next post.<br>If AI, browser-use, LLMs, or WASM / WebGPU topics interest you subscribe below!
Subscribe
The core parts to a browser-use agent
As seen above, the core parts of browser-use are the capture, ground, and act steps. For now, let’s stick to talking about capture.<br>You have three options for the capture step:<br>Send an entire DOM tree (all the HTML markup) to a text-based LLM which then responds with the coordinates for the actions you want to take, and executes them accordingly
Don’t send any HTML to the LLM, instead have the LLM try to intelligently call out to the DOM based on a number of CSS selectors (i.e. user says click the Order button and the LLM looks for buttons with “Order” in their text.
Send an image of the page to a vision language action model (preferred option)
Option 1 is not practical though because of the huge size of DOM trees. Also we didn’t even consider the fact that in order for a LLM to truly “understand” a page, you would also need the styles. The following table compares the two options for a few sample webpages.
As you can see the screenshot option is a lot more practical in regards to context size for a browser-use library. Context size is very limited on webgpu. Running on a m4 max gives these:
Constraints are good for end-users though - it means their machines won’t crawl to a halt because of a client side LLM running.<br>For option 2 - have the LLM generate DOM query selectors - I didn’t even attempt that - most likely the performance would be so bad it would not make it worth it. I can think of so many edge cases for it:<br>Iframes - the code required to traverse all iframes as well and include that in the context would be a) very difficult and b) probably hit a lot of security problems. A vision model handles this elegantly because you actually see what the user sees, iframe or not.
Canvas / WebGL / - vision based models could actually “see” these videos so you could ask things like “click the video that has a panda in it”
“Click the green button” - natural for VLAs
The capturing implementation
Now that we’ve discussed why we are using a vision-based approach, let’s talk about the actual capturing implementation. What I mean again by capturing, is the browser-use agent capturing the page you are on and converting it to a screenshot that a VLA can read.<br>There are a couple options which we are going to be looking at:
html2canvas (https://github.com/niklasvh/html2canvas) is probably one of the first libraries to do this and has existed for years. This library has not been maintained for some time though, so that led to other forks being developed - html2canvas-pro (https://github.com/yorickshan/html2canvas-pro) being one of them.<br>However, newer options became available in recent years which followed a new methodology. One of these libraries is called snapdom. This quickly became popular as seen below:
If you like these articles about browser-use subscribe below!
Subscribe
Let’s compare the different options and go into why SnapDOM is preferable for a browser-use task.<br>html2canvas
html2Canvas and its fork operate on the same methodology: walk the dom, gather computed styles, and re-draw that to a canvas using canvas draw commands. Let’s take an example:<br>Live DOM<br>Submit
Html2Canvas will look at this and execute roughly the following:
// Pseudocode — what the library effectively generates<br>const ctx = canvas.getContext('2d');<br>// background + border-radius (no compositor — you draw the pill yourself)<br>ctx.save();<br>ctx.fillStyle = '#22c55e';<br>roundRect(ctx, 120, 48, 88, 36, 8); // x,y,w,h,r from layout math<br>ctx.fill();<br>ctx.restore();<br>// "Submit" label (font metrics parsed and re-measured)<br>ctx.save();<br>ctx.fillStyle = '#ffffff';<br>ctx.font = '600 14px Inter';<br>ctx.textBaseline = 'middle';<br>ctx.fillText('Submit', 132,...