CAPTCHAs can still detect AI agents

timshell1 pts0 comments

CAPTCHAs can still detect AI agents | Roundtable Research

Main site

CAPTCHAs can still detect AI agents

AI systems now match and exceed humans on many tasks, but behave through measurably different cognitive processes.<br>This gap can be exploited to detect AI agents and online bots.

This is a ~1000 word overview of our recent machine learning conference paper submission. To read the full preprint, click<br>here.

"CAPTCHAs are broken these days." AI can easily identify all the traffic lights in a static grid. So CAPTCHAs<br>don't provide a valuable human signal, right?

Yes and no.

Yes, because vision language models (VLMs) can recognize images like chimneys, fire hydrants, and traffic lights.<br>Deep learning "solved" CAPTCHA-style image classification in the early 2010s.

No, because AI does not complete CAPTCHAs like humans. If you look across all the data of humans and AI completing<br>CAPTCHAs, you start noticing differences in features like error patterns. Our recent paper found statistically significant differences across<br>sequential click patterns, direction changes, and overselection behavior - features that define how a participant,<br>agent or human, would solve the CAPTCHA problem. In other words, AI can solve CAPTCHAs, but they don't solve them<br>like humans.

Figure 1:<br>Humans and Claude/GPT/Gemini perform at similar task performance levels on the classic CAPTCHA, but there are<br>statistically significant process differences across features like sequential score, direction change, and<br>overselection.

The Turing Test - originally proposed in 1950 by Alan Turing - offers a simple criterion for machine intelligence. If<br>a judge cannot reliably distinguish a machine's responses from a human's, the machine can be considered<br>intelligent.

Turing understood this behavioral criterion was a concession and not the end-all-be-all of human vs. machine<br>intelligence. He had to concede: the question is too difficult, abstract, and loaded. Behavioral<br>indistinguishability provided a more tractable condition, and one that seemed like a good North Star in the 1950s.

Following Turing's footsteps of defining an adversarially robust discriminator that can separate humans from bots,<br>we designed CogCAPTCHA30. This goes one level deeper than the Turing Test, from exploring output (what<br>humans and agents can do) to process (how it can do it). CogCAPTCHA30 combines the original CAPTCHA with 29<br>classic cognitive psychology tasks for a 30-task battery.

Figure 2:<br>CogCAPTCHA30 measures humans and agentic process behavior across decision-making, memory, perception, and<br>reasoning.

We recruited human participants and also deployed AI agents to perform these tasks. The CAPTCHA experiment<br>demonstrated that humans and agents can perform at similar performance (output) levels, but with different<br>processes. We then measured output equivalence - how (how similar their answers were)<br>andprocess equivalence (how they arrived at their answers) across the whole 30-task paradigm and found that they were uncorrelated:

Figure 3:<br>We measured how similar humans and agents are across output (Cohen's d) and process (AUC). Across the task set,<br>these measures are uncorrelated, suggesting output equivalence does not equal process equivalence.

While the classic Turing test measures whether a machine produces output indistinguishable from a human, we<br>propose a Process Turing Test measuring whether machines produce a process indistinguishable from humans.

Our results raise two questions: what types of language models - if any - are like humans, and how adversarially<br>robust is this discrimination process?

To answer the first question, we compared the distance between humans and state-of-the-art frontier models<br>(OpenAI's GPT, Anthropic's Claude, Google DeepMind's Gemini) as well as Qwen (an open-source 1.5B foundation<br>model) and Centaur (an open-source 70B-parameter foundation model of human cognition).

Figure 4:<br>State-of-the-art frontier models (Claude, GPT, Gemini) have less similar human process features compared to<br>smaller models (Qwen, Centaur).

We found that state-of-the-art frontier models (Claude, GPT, Gemini) have less similar human process features<br>compared to smaller models (Qwen, Centaur). As we argued in AI Capability isn't Humanness, while<br>frontier models are becoming more powerful over time, they are not necessarily becoming more human. Contemporary<br>progress in artificial intelligence is independent of progress in human simulation.

Qwen, a smaller open-source model, is more humanlike than the larger Claude, GPT, and Gemini. And, as a nice<br>validation, Centaur outperforms the other models in similarity to human process feature space. We hypothesize this<br>is due to large-scale output fine-tuning, specifically 10M+ human choices across 160 cognitive experiments.

This introduces the second question: how adversarially robust is the process to discriminate humans from agents?<br>Any behavioral feature used to distinguish the two may itself become a...

humans human process agents models captchas

Related Articles