What does your user agent do?

kinlan2 pts0 comments

ua-tracer: what does a user agent actually do? | Modern Web Development with Chrome

The work that led to this is a research project that I will soon publish on aifoc.us, where I am analysing if the presence of a URL in a prompt influences the output based on the latent "knowledge" about that URL in the model. While doing this project I needed to test a heap of URLs and see if their data was in the model or not and I hit on a heap of problems. The more time I spent in that data, the more I realised I couldn't answer a basic question about a large share of the traffic fetching those URLs: when one of the many agents, indexers, and scrapers loads a page, what does it actually do? Download the HTML and stop? Parse the CSS? Follow the font linked from inside that CSS? Can they run the JavaScript, or just fetch the .js file and move on?

So I built ua-tracer to answer it, and it's a tool that I think will be useful for many web developers trying to understand what any browser or bot does when it accesses your site.

Click through any row in that list and you land on the trace detail, a request-by-request account of what that one user agent did:

How it works

Every time a user agent loads https://uatracer.com/, the site mints a unique trace id and a per-request secret, and renders a page whose every asset (stylesheet, script, images, font, manifest) carries both in its path. The secret is private to that one page load, so only the agent that received the HTML can fetch the probe assets: no guessing, no replay by anyone else.

/r/{id}/{secret}/style.css the real stylesheet<br>/r/{id}/{secret}/main.js the real script<br>/r/{id}/{secret}/photo.png a real PNG<br>/r/{id}/{secret}/font.woff2 a real woff2 font<br>Because the id is unique per page load, every later asset request can be tied back to the exact homepage hit and the user agent that made it. I also added other probes that come what those assets themselves reference (paths below drop the {secret} segment for readability):

Probe<br>Referenced from<br>Hitting it proves…

/r/{id}/{secret}/css-bg.png<br>a background-image: inside style.css<br>the UA parsed the CSS and followed a URL inside it

/r/{id}/{secret}/css-font.woff2<br>an @font-face { src: } inside style.css<br>the UA resolved a CSS font source

/r/{id}/{secret}/manifest-icon.png<br>icons[].src inside manifest.json<br>the UA parsed the manifest and followed an icon

/r/{id}/{secret}/js-ran.gif<br>new Image().src = … in main.js, at runtime<br>the UA executed classic JS

/r/{id}/{secret}/module-ran.gif<br>a runtime beacon in an ES module<br>the UA executed an ES module

/r/{id}/{secret}/timing<br>a POST of performance.getEntriesByType('resource')<br>a real engine ran and produced a client-side waterfall

A plain downloader (think curl) nomrally just fetches the HTML and stops. A CSS-aware fetcher additionally hits css-bg.png and css-font.woff2. A UA that parses the manifest reaches manifest-icon.png etc. Social unfurlers (Twitterbot, Discordbot etc) fetch the social-card image. Finally, only a user agent that runs JavaScript will ever touch js-ran.gif or post to /timing.

What a trace looks like

Open any trace and you get a request-by-request account of how far the agent went: which asset types it fetched, whether it followed the CSS-linked resources, whether it executed JavaScript, and the client-side resource-timing waterfall it posted back if it did.

This particular trace is synthetic: I generated it by pointing a made-up user agent at the site and walking it through the assets by hand:

curl -A "Mozilla/5.0 (compatible; DemoBot/1.0; +https://example.com)" https://uatracer.com/<br># then fetch the trace-scoped assets it references:<br>curl https://uatracer.com/r/{id}/{secret}/style.css<br>curl https://uatracer.com/r/{id}/{secret}/js-ran.gif # the JS-execution beacon

The trace detail is intentionally public: share /trace/{id} as a link and anyone can read the result, which makes it easy to pass a finding along ("look, this agent doesn't run JS").

Some early analysis on what the bots actually do

This site hasn't been live that long so it's not got heaps of data yet (hence this post to try and raise some awareness), but I've got some insights that I think are interesting.

One caveat first: a User-Agent string is trivially spoofable. The tool also checks each request's source IP against the CIDR ranges its operator publishes. Every major crawler operator now publishes a list you can match against: Google, OpenAI's GPTBot and OAI-SearchBot, Bing, and Anthropic's ClaudeBot (IP list). So a "Googlebot" or "ClaudeBot" trace is either verified (the IP is in the operator's published range) or flagged as likely spoofed. The one exception in the data so far is OpenAI's ChatGPT-User, which is triggered by a person rather than run on a schedule and publishes no fixed ranges, so its User-Agent alone can't be confirmed.

Bots from the same company do not behave alike. OpenAI seems to run at least three agents against the site (all IP-verifiable), and each does something...

secret agent user trace font request

Related Articles