Lecturing Common Crawl: Publishers Tell Nonprofit To Stop Unauthorized Scraping 05/05/2026
While other MediaPost newsletters and articles remain free to all ... our new Research Intelligencer service is reserved for paid subscribers ...
Subscribe today to gain access to every Research Intelligencer article we publish as well as the exclusive daily newsletter, full access to The MediaPost Cases, first-look research and daily insights from Joe Mandese, Editor in Chief.
Become a subscriber today!
If you're already a paid subscriber, please sign-in.<br>Username
Password<br>Forgot?
Become a free MediaPost member now to read this article<br>Unlimited articles every day<br>Keep up-to-date with media, marketing and advertising news<br>Invitations to exclusive industry events and research
Log in if you are already a member<br>Username
Password<br>Forgot?
Subscribe to Publishers Daily
Commentary<br>Lecturing Common Crawl: Publishers Tell Nonprofit To Stop Unauthorized Scraping
by Ray Schultz<br>, Columnist,
May 4, 2026
Common Crawl, the historical web archive, is facing pressure from publishers to stop its alleged scraping and storage of content without permission.<br>The News/Media Alliance<br>(NMA) sent a letter to the nonprofit last week, urging it to stop using publishers’ content.<br>This is a little surprising given that Common Crawl provides an open repository of web<br>crawl data that can be used by anyone for free, as it says on its website. It’s a worthy service, right? But that’s not exactly how NMA sees it.<br>“Common Crawl is blatantly<br>taking our content without our permission and failing to honor our opt outs to remove content already taken,” says Danielle Coffey, president and CEO of the News/Media Alliance. “We<br>encourage them to act like the good actor they claim to be, honor these requests, and make clear to their users that the content they scrape is not authorized for commercial use unless expressly<br>permitted.”<br>advertisement<br>advertisement
The Atlantic alleges that Common Crawl’s archive has been a primary source used to train commercial AI models without authorization by publishers.<br>Moreover, while Common Crawl now allows copyright holders to put their names on an “opt-out” list to prevent future web scraping, it has failed to remove content it has scraped from its<br>archives or to confirm it will do so, NMA charges:<br>NMA demamds that Common Crawl:<br>Add a clear warning on its opt-out registry that users are not allowed to use the content<br>for unauthorized uses and that such use is a breach of Common Crawl’s terms.<br>Revise these terms to state that use of the repository is prohibited for AI<br>purposes<br>Upon request of publisher, remove content from its repository<br>Add a clear statement to its website stating that Common Crawl doesn’t own<br>and can’t authorize use of scraped content in repository; prohibits unauthorized use of such content for AI purposes; respects IP of news publications to prohibit such use; will remove content<br>from archive upon publisher request; and will add pub licensing contact info in registry upon request.
content,
content issues,
copyright,
publishing,
web sites<br>Comment<br>Next story loading
About the Author<br>Ray Schultz is the former editor of DM News, Chief Marketer, Direct, Circulation Management and other marketing titles.
advertisement
More from Publishing Insider<br>Fashion Forward: 'WWD' And 'New York' Enhance Their Coverage<br>Rate Blast: News Alliance Fights July Postage Increases<br>Game Time: Legendary Publication Debuts Interactive Puzzles, Other Features<br>USA Today Co. Owns Up: Clarifies Comments Made By CEO Mike Reed<br>Epoch Buzz: 'The Epoch Times' Pursues In-Depth Content
SPONSOR CONTENT