The Pirated Corpus Was Always a Balance-Sheet Item | Abhishek Shankar's Blog
×<br>Posts<br>The Map<br>About<br>Subscribe
The number people are reacting to is $1.5 billion. The number is the wrong thing to react to.<br>Anthropic mirrored a set of shadow libraries — LibGen, Books3, Pirate Library Mirror — and used the books in them to train Claude. The model exists. The settlement does not delete the model. It deletes the inputs and pays the bill. The bill is roughly $3,000 per qualifying book across about 500,000 titles, and the question worth asking is not whether Anthropic should have paid it. The question is what the payment actually closes — and what, by deliberate omission, it leaves open.<br>What it closes is an arbitrage. What it leaves open is the asset the arbitrage produced.<br>The frame here is not "Anthropic got caught." Plenty of labs scraped plenty of dubious corpora; getting caught is, structurally, a function of timing, plaintiffs' counsel, and discovery. The frame is that this bill — payable in 2026, against a 2022 decision — establishes a settlement-cap on shadow-library training that wasn't previously legible. Until this case closed, the cost was theoretical. Now it is priced. And the price, against the asset that was built with the input, is small.<br>That is the part that gets missed. The settlement is being read as a deterrent. It is much closer to a tariff.<br>The shortcut had a measurable size<br>To see why the trade worked, look at what the alternative cost.<br>Reddit's data licensing deal with Google is approximately $60 million per year, with the OpenAI deal estimated at around $70 million annually. News Corp's deal with OpenAI is reported at over $250 million across a five-year window. Springer Nature licensed its previously published academic papers to Google in a one-time $23 million deal; Wiley closed a similar agreement at $23 million; Taylor & Francis took $10 million upfront with recurring payments running through 2027. The whole licensed training-data market was valued at roughly $3.4 billion in 2025 — and Anthropic's single settlement, at $1.5 billion, represents 44% of that total.<br>These are 2024–2025 prices. They are also 2024–2025 corpora — high-velocity user-generated text, real-time API access, structured academic archives. None of them are five hundred thousand books, end to end.<br>What would licensing five hundred thousand books have cost in 2022? There was no market price. There was no clearinghouse, no template deal, no protocol like the Really Simple Licensing scheme modeled on ASCAP and BMI that emerged in late 2025. There was a publisher landscape that did not understand what AI training was, a long tail of out-of-print authors with no way to negotiate, and a transaction-cost wall that — even assuming infinite goodwill on both sides — would have taken years to clear. The implicit cost of the "do it properly" path in 2022 was not the per-book price. It was the timeline.<br>That is the arbitrage Anthropic captured. Not "free books." Time. The shadow-library shortcut converted a multi-year licensing program into a single afternoon's wget. The model that came out the other side was Claude — shipped, monetizing, embedded in deployments, defining the company's market position.<br>Claude generated meaningful revenue across 2023, 2024, and 2025. Anthropic's enterprise pipeline, its Claude Code product, its position in the foundation-model market — all of it sits on a model trained, in part, on the corpus the settlement is now compensating. Discount that future cash flow back to 2022 and the math is roughly this: the company traded an unbounded liability for the option to build the asset early, and three years later paid 44% of an annual licensing-market tally to make the liability bounded.<br>You could call that a mistake. You could also call it the trade.<br>The deeper point is what books, specifically, gave the model. Long-form coherence. Narrative structure across tens of thousands of tokens. Citation density. The implicit grammar of how educated argument is built — paragraph to paragraph, chapter to chapter, claim to evidence. Web text trains a model to sound right. Books train a model to think across a horizon. The two are not substitutable, and there is a reason every frontier lab pursued book corpora in some form between 2020 and 2023, before the licensing market was a real market. The pirated corpus was not an opportunistic add-on. It was load-bearing for the capability that distinguished early Claude from the field. That is a hard thing to say in court and a clarifying thing to say in an editorial.<br>The downstream effect on competitor behavior is visible. The WebText2 dataset — Reddit posts filtered for high engagement — has been documented as receiving roughly 5x weighting in GPT training, an explicit acknowledgement that not all training tokens are worth the same. Books, in the same logic, are not interchangeable with crawled web text. They are a higher-density input. Paying $3,000 per book after...