Publishers Tell Common Crawl to Stop Unauthorized Scraping

Lecturing Common Crawl: Publishers Tell Nonprofit To Stop Unauthorized Scraping 05/05/2026

While other MediaPost newsletters and articles remain free to all ... our new Research Intelligencer service is reserved for paid subscribers ...

Subscribe today to gain access to every Research Intelligencer article we publish as well as the exclusive daily newsletter, full access to The MediaPost Cases, first-look research and daily insights from Joe Mandese, Editor in Chief.

Become a subscriber today!

If you're already a paid subscriber, please sign-in. Username

Password Forgot?

Become a free MediaPost member now to read this article Unlimited articles every day Keep up-to-date with media, marketing and advertising news Invitations to exclusive industry events and research

Password Forgot?

Subscribe to Publishers Daily

Commentary Lecturing Common Crawl: Publishers Tell Nonprofit To Stop Unauthorized Scraping

by Ray Schultz , Columnist,

May 4, 2026

Common Crawl, the historical web archive, is facing pressure from publishers to stop its alleged scraping and storage of content without permission. The News/Media Alliance (NMA) sent a letter to the nonprofit last week, urging it to stop using publishers’ content. This is a little surprising given that Common Crawl provides an open repository of web crawl data that can be used by anyone for free, as it says on its website. It’s a worthy service, right? But that’s not exactly how NMA sees it. “Common Crawl is blatantly taking our content without our permission and failing to honor our opt outs to remove content already taken,” says Danielle Coffey, president and CEO of the News/Media Alliance. “We encourage them to act like the good actor they claim to be, honor these requests, and make clear to their users that the content they scrape is not authorized for commercial use unless expressly permitted.” advertisement advertisement

The Atlantic alleges that Common Crawl’s archive has been a primary source used to train commercial AI models without authorization by publishers. Moreover, while Common Crawl now allows copyright holders to put their names on an “opt-out” list to prevent future web scraping, it has failed to remove content it has scraped from its archives or to confirm it will do so, NMA charges: NMA demamds that Common Crawl: Add a clear warning on its opt-out registry that users are not allowed to use the content for unauthorized uses and that such use is a breach of Common Crawl’s terms. Revise these terms to state that use of the repository is prohibited for AI purposes Upon request of publisher, remove content from its repository Add a clear statement to its website stating that Common Crawl doesn’t own and can’t authorize use of scraped content in repository; prohibits unauthorized use of such content for AI purposes; respects IP of news publications to prohibit such use; will remove content from archive upon publisher request; and will add pub licensing contact info in registry upon request.

content,

content issues,

publishing,

web sites Comment Next story loading

About the Author Ray Schultz is the former editor of DM News, Chief Marketer, Direct, Circulation Management and other marketing titles.

More from Publishing Insider Fashion Forward: 'WWD' And 'New York' Enhance Their Coverage Rate Blast: News Alliance Fights July Postage Increases Game Time: Legendary Publication Debuts Interactive Puzzles, Other Features USA Today Co. Owns Up: Clarifies Comments Made By CEO Mike Reed Epoch Buzz: 'The Epoch Times' Pursues In-Depth Content

SPONSOR CONTENT

Publishers Tell Common Crawl to Stop Unauthorized Scraping

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast