Text AI watermarks will always be trivial to remove

Text AI watermarks will always be trivial to removeThe European Union AI Act will begin to be enforceable in August 2026, one month from now1. One of the biggest new requirements is Article 50, which requires all AI outputs to be “detectable as artificially generated”. In other words, if LLM providers want to do business in the EU, they will have to apply a watermark to their outputs2: some hidden signature that can be used to identify AI content.

LLM text watermarking is a fascinating problem. Like the best engineering problems, it is theoretically hard to solve perfectly, but has multiple partial solutions: for instance, Google’s SynthID, and (as I’ll argue) some quiet Unicode trickery from OpenAI and Anthropic. It will be interesting to see how the AI labs navigate these tradeoffs before the end of the year.

Why text watermarking is hard

I wrote about AI watermarking at the end of last year in AI detection tools cannot prove that text is AI-generated. It’s easy to watermark an image, because digital images contain lots of noise that the human eye can’t really see. For instance, you could apply a watermark like “these twenty pixels in these exact spots will always share a color”. Text is much, much harder. Unlike images, text is a very compressed medium: you cannot make any change to a sentence that a human wouldn’t notice (with one exception, which we’ll get to later). So how are you supposed to watermark it?

It’s basically a text steganography problem (concealing a secret code), made more difficult because the plaintext cannot be arbitrarily manipulated. Any changes you make to apply the watermark will compromise the quality of the output. For instance, “every fifth letter is an ‘e’” would be a good watermark, but applied naively would make the AI output full of typos. Could you just let the model figure out how to fit the watermark? Strong AI models are smart enough to juggle this kind of constraint3, but it’d still consume reasoning time that would be better spent on the user’s problem, and make the model sound much less capable than it is4.

Do we need watermarks to detect AI content?

Do you really need a watermark? If you’re Anthropic, and you’re required to be able to verify whether your models produced a particular block of text, can’t you simply run the text through each model, measuring as you go how closely the model’s predicted tokens match each token from the text?

Not really. The space of “all possible Claude Sonnet answers to a question” is way larger than the space of “all possible watermarked answers to a question”. In other words, you’d get too many false positives for human text that reads like it was AI-written. It’s way more likely for a human to accidentally write like Claude than it is for a human to accidentally reproduce a watermark.

It would also be prohibitively expensive to run every Anthropic model against a piece of text in order to watermark it. The EU AI Act will eventually require labs like Anthropic to offer free watermarking services to every EU citizen (see Commitment 2). You couldn’t do that with the “run the model” approach.

How SynthID works

As far as I know, the only AI provider to say they watermark text output is Google, who use a tool called SynthID. Here’s how it works.

When a LLM generates text, it’s generating a series of tokens (words or chunks of words). At each step, the model itself doesn’t output a single token, but instead outputs a full list of all (say) 100,000 tokens in its vocabulary, each annotated with the probability that that token will be the next one. Tools like ChatGPT or Claude Code will pick semi-randomly from the most likely options in order to get their outputs. This semi-random sampling process can be influenced in a detectable way.

For instance, we could choose a sampling strategy like “we pick the second most likely token, then the first, then the second, then the first, and so on”. That would still produce high-quality output, but you’d be able to re-run the model against the generated text to verify that the pattern holds. However, that’d make verification really expensive, and any slight tweaks to the output would break the pattern and thus break the fingerprint. Is there a better way?

Yes. SynthID is a process for assigning each token a “score” based on its previous tokens (for instance, sum the token’s ID with the IDs of its previous three tokens then take mod 5)5. To apply the watermark, the model adopts a sampling strategy like “out of the top five most likely tokens, pick the one with the top SynthID score”6. The watermark can then be detected by calculating the aggregate SynthID score of a block of text. If it’s suspiciously high, it’s very likely to have been AI-generated.

This is basically a version of the common advice that you can identify LLMs by use of the em-dash, except that instead of a list of keywords, it relies on subtle mathematical relationships between words that humans can’t identify. Because the...

Text AI watermarks will always be trivial to remove

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI