Every AI Visibility Tool Is Lying to You

I'm an experienced software engineer, and I've spent enough time building and debugging measurement systems to know when a dashboard is asking you to trust a number it cannot support. A new software category now promises to tell brands how visible they are inside ChatGPT, Claude, Gemini, Perplexity, and Google's AI answers. Then it turns that messy system into tidy claims like mention rate, citation rate, share of voice, or rank.

When a tool says you are number four in your category, moved up two spots this week, or sit at 17% visibility while a competitor sits at 31%, I do not think the signal is worthless; I think the precision is made up. These systems are noisy, personalized, geographic, nondeterministic, and constantly changing, so a clean leaderboard number hides the thing an engineer would actually want to inspect: the distribution, the methodology, the variance, and the raw evidence.

Most vendors are trying to measure something important, but the mechanism is usually weaker than the dashboard admits. If a tool claims to show "what customers see" in ChatGPT or Claude, it is probably scraping the consumer app or calling an API. A scrape captures one synthetic session, and an API call uses a different surface than your customer uses. Both can produce useful directional signal, especially when they reveal invisibility on commercial prompts or gaps in a geography, but neither should be sold as a precise, stable truth without showing its work.

The frontend scrape problem

Scraping the ChatGPT or Claude frontend sounds persuasive at first. The vendor can say, truthfully, that it opened the app, asked the question, and recorded what the product returned.

This is closer to the surface a real user sees. It still measures one controlled surface.

A scrape comes from one account, or a controlled account pool. That means one history state, one memory state, one subscription tier, one geography, one browser session, and one prompt. Change any of those and the answer can change. A real buyer asking "best CRM for a seed-stage startup" and a clean browser asking "best CRM software" from a datacenter IP are different instruments.

Mass scraping adds more bias. At any meaningful volume, the work has to run from somewhere: cloud machines, proxy routes, managed browsers, headless sessions, or another automation layer. That automation layer can bleed into the measurement. Concentrated IP patterns. Repeated logins. Odd session rhythms. Rate-limit pressure. Possible anti-abuse handling from the AI product itself.

The operator has to choose. Clean accounts are repeatable and unlike customers. Aged accounts have history and weaker controls. A benchmark account that asks thousands of category prompts also creates its own personalization trail. After a while, the account's whole life is benchmark traffic.

This matters most for local and commercial prompts. "Best commercial roofing company near me" changes by place. "Best AEO agency in NYC" changes by place. The answer depends on the user's location, the retrieval system, the account, and the sources pulled at that moment.

A single frontend answer is one lab sample.

The same prompt changes across runs

The simplest defense of an AI visibility rank is this: we ask the same question every week and count whether you show up.

This only works if the same question has a stable answer. The same words often produce different answers.

Even temperature-zero LLM calls are not perfectly stable in production. Thinking Machines Lab explained one technical reason: batching and kernel behavior can vary under real production load. Their example showed identical temperature-zero requests producing multiple unique completions.

SparkToro and Gumshoe saw the marketing version of the same problem. They had volunteers run repeated commercial prompts through ChatGPT, Claude, and Google's AI products. Their research found that brand recommendations changed a lot across repeated runs.

This is the core measurement problem. If the next draw from the same system can name a different set of brands, then "you rank number four" becomes one sample from a distribution.

An honest dashboard would show the distribution.

Consumer apps and APIs behave differently

Some tools skip browser scraping and call provider APIs instead. The operational case is strong. API calls are easier to repeat, easier to audit, cheaper to run at scale, and less likely to break when a web app changes.

The tradeoff: the API and the consumer app behave differently.

The consumer product may have memory, account personalization, model routing, web retrieval, location inference, shopping modules, local modules, citations, and product-specific presentation. The API gives you a programmable model call with the tools and parameters you enable. OpenAI's API docs, for example, require you to add tools such as web search when you want grounded retrieval. Google's Gemini API has its own grounding and search configuration.

The gap...

Every AI Visibility Tool Is Lying to You

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI