US publishers tell Common Crawl to stop scraping and delete archive - Press Gazette
Common Crawl website. Picture: Shutterstock/IB Photography
Digital news publishers in the US have raised “significant legal concerns” over the scraping of their content by Common Crawl Foundation.
Trade body Digital Content Next (DCN), which represents many major US publishers, has sent a cease and desist letter via its lawyer to the web archive creator.
They called on Common Crawl to immediately stop “scraping, retaining, or sharing copyrighted, paywalled, subscriber-only, or otherwise protected content from DCN member companies in its datasets”.
They also requested that publisher content already in the Common Crawl datasets is removed.
Since 2008 Common Crawl has scraped billions of pages on the internet each month to create a free archive for the public and is often cited in academic research.
The database has been widely used to train major AI models, proving controversial because it gave them access to swathes of publisher articles including, allegedly, paywalled content.
Its CCBot is now one of the most blocked AI scrapers by many news websites who do not see the value exchange in allowing their content to be crawled.
Common Crawl accused of potentially ‘inaccurate or misleading’ statements to publishers
Common Crawl publishes a registry of all the website owners that have asked to opt out of being scraped, including major news publishers such as the BBC, The Guardian, the Financial Times, The Washington Post, News Corp, DMG Media, Advance Publications, Associated Press, Le Monde, Reuters and Hearst Newspapers. More than 900 news websites are included under an entry submitted by US trade association News/Media Alliance.
The DCN legal letter, seen by Press Gazette, shared concerns about whether Common Crawl is complying with opt-out instructions and whether it is removing content that had previously been scraped when instructed to do so.
“For example, DCN understands that Common Crawl has in some instances confirmed that it was complying with such instructions only to claim later, after significant delays, that the costs needed to address technical challenges prevented it from doing so,” the letter said.
DCN’s lawyers are looking at whether statements made by Common Crawl such as these “may have been inaccurate or misleading, thus potentially constituting legally actionable fraudulent or negligent misrepresentations”.
The copyright lawsuit filed by The New York Times against ChatGPT creator OpenAI at the end of 2023 cited Common Crawl as 60% of the training mix for the GPT-3 model. Common Crawl has since agreed to remove NYT content from its archives, and has confirmed a separate request from publishers represented by the Danish Rights Alliance. But The Atlantic reported in November that content from both were still available.
Common Crawl executive director Rich Skrenta denied "lying to publishers" following The Atlantic’s reporting, saying: "When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset."
He added: "No one at Common Crawl has ever claimed this work was instantaneous or complete; rather, we have been open about its complexity and ongoing nature."
Skrenta also denied that CCBot goes "behind paywalls" to scrape websites.
He declined to comment specifically in response to the DCN legal letter.
Common Crawl ‘flagrantly infringed’ copyrighted publisher content
The DCN letter claimed Common Crawl has “flagrantly infringed” copyrighted content by creating and distributing its datasets and by sharing them with AI companies knowing that they “are actively engaged in the reproduction of that protected content”.
The letter also argued that “copyright law is not an opt-out regime” so the system was working the wrong way round.
It said: “Common Crawl has undermined copyright owners’ right to control the use of their content by creating and distributing datasets that DCN understands to contain substantial volumes of original, protected content created by DCN members at significant cost.
“Such conduct would be legally problematic in and of itself. But Common Crawl has exacerbated this misappropriation by actively marketing its datasets ‘for free’ to for-profit entities for commercial purposes, such as developing AI tools or training AI large language models.
“In other words, Common Crawl is not only creating datasets containing digital content creators’ and owners’ original, protected content without permission or compensation, but is knowingly using its datasets to help for-profit AI companies develop competing or substitutive products and services.”
DCN chief executive Jason Kint said in a blog post that the legal notice “challenges a growing assumption that content created through substantial investment can be collected, stored, repurposed, and monetised simply because it is technically...