SEOIndustry Guide12 min readPublished June 12, 2026

Six escalation stages · immutable archives · AI firms fund 60%+ of the budget

Publishers vs Common Crawl: The AI Training-Data Showdown

On June 3, 2026, Digital Content Next sent Common Crawl a cease-and-desist on behalf of AP, the New York Times, NBC, Bloomberg, NPR, and Fox. The legal argument is sharp — copyright is not an opt-out regime. The technical reality is sharper still: the archive format makes deletion structurally slow, and blocking CCBot today does nothing about what was already taken.

DA
Digital Applied Team
Senior strategists · Published Jun 12, 2026
PublishedJun 12, 2026
Read time12 min
SourcesPress Gazette, SEJ, Mozilla
Cease-and-desist sent
Jun 3
DCN to Common Crawl, 2026
Publishers in opt-out registry
900+
via News/Media Alliance
Dutch articles removed (BREIN)
2M+
Nov 2025 action
Danish removal after request
~50%
after six-plus months

The publishers versus Common Crawl fight stopped being a quiet standards dispute on June 3, 2026, when Digital Content Next sent the web-archiving nonprofit a cease-and-desist on behalf of the Associated Press, the New York Times, NBC Universal, Bloomberg, NPR, and Fox. The letter does two things at once: it demands Common Crawl stop scraping protected content, and it demands removal of member content already sitting in the datasets that AI labs train on.

What makes this more than another opt-out skirmish is the argument underneath it. DCN's position is that copyright is not an opt-out regime — publishers should not have to ask to be excluded; Common Crawl should have to ask before it includes them. That reframes the entire crawler-access debate, and it lands at the same moment the technical reality of removal is becoming impossible to ignore.

This is a news analysis and a rights-landscape readout, not a blocking how-to. If you want the mechanics — the exact robots.txt and llms.txt syntax for stopping CCBot and other AI crawlers — our AI crawler access-control decision matrix covers that ground. Here we explain why the cease-and-desist matters, why deletion is structurally slow, and what a publisher can realistically expect at each rung of the escalation ladder.

Key takeaways
  1. 01
    A trade body went legal on behalf of six major newsrooms.DCN's June 3, 2026 cease-and-desist represents AP, NYT, NBC, Bloomberg, NPR, and Fox. It demands Common Crawl both stop scraping protected content and remove member content already in its datasets, including paywalled articles.
  2. 02
    Removal is hard by architecture, not by stubbornness.Common Crawl's archives use the immutable WARC file format. Editing a published archive without breaking its integrity is structurally difficult, which is why Danish publishers reportedly waited six-plus months for roughly 50% removal.
  3. 03
    Blocking CCBot is necessary but not sufficient.A robots.txt disallow stops future crawls, but historical crawls already in the archive are unaffected. Content taken before you blocked stays available to anyone who downloads the dataset.
  4. 04
    The funding conflict is the underreported story.Per a 2026 industry analysis, more than 60% of Common Crawl's 2024 donations came from AI-affiliated entities. The same nonprofit that arbitrates publishers' removal requests is substantially funded by the firms that benefit from inclusion.
  5. 05
    The crawl-to-referral math is the commercial case.One analysis found AI crawlers retrieving tens of thousands of pages for every visit they send back. When training crawls return almost no traffic, the value exchange that justified open crawling collapses.

01What HappenedA cease-and-desist, and the argument that reframes everything.

Digital Content Next sent its cease-and-desist letter to Common Crawl on June 3, 2026, and CEO Jason Kint announced it publicly the following day. The letter accuses Common Crawl of flagrantly infringing copyrighted content by creating and distributing datasets, then sharing them with AI companies knowing that those companies are actively reproducing protected material. It asks for two things: stop scraping, and remove what has already been taken — including paywalled and subscriber-only news articles.

The members standing behind the letter are not marginal players. DCN represents the Associated Press, the New York Times, NBC Universal, Bloomberg, NPR, and Fox. And the letter signals that DCN's lawyers are examining whether Common Crawl's prior statements to publishers were inaccurate or misleading — pointing to cases where the nonprofit reportedly confirmed it would comply with removal requests, then later cited technical costs as the reason full removal never happened.

Copyright law is not an opt-out regime.— Digital Content Next, cease-and-desist letter to Common Crawl, June 3, 2026

That single line is the whole thesis. The prevailing internet norm — crawl by default, honor opt-outs when asked — treats inclusion as the baseline and exclusion as the exception you have to request. DCN's argument inverts the burden: permission first, inclusion second. Whether that holds up legally is a separate question, but as a framing, it moves the fight from technical compliance to consent.

This was not the first shot
DCN's cease-and-desist followed an earlier collective demand. The News/Media Alliance sent its own letter to Common Crawl on April 29, 2026, reported publicly the next day, roughly five weeks before DCN. The NMA represents publishers ranging from NBCUniversal and CNN to McClatchy, Vox Media, Ziff Davis, USA Today, Boston Globe Media Partners, and hundreds of regional outlets — more than 900 news websites are covered through its membership.

Common Crawl has not stayed silent. Executive Director Rich Skrenta denied misleading publishers and described a removal process that, in his telling, reflects how the dataset is actually built. He also declined to comment specifically on the DCN letter. The gap between his account and the publishers' experience is not really about good faith — it is about file formats, which is where the next section goes.

02The Removal ParadoxWhy deletion is structurally slow.

Most coverage frames this as Common Crawl refusing to remove content. The more accurate framing is that the archive was never designed to support removal. Common Crawl stores its captures in the WARC format — Web ARChive, the same standard libraries and archivists use worldwide. A WARC file is closer to a printed book than to an editable database: once a crawl is written, editing individual records after the fact risks compromising the integrity of the whole archive.

So when a publisher asks for removal, Common Crawl's stated approach is to filter the affected URLs out of subsequent crawls and make them inaccessible through its public tools and indices — but it does not delete the existing archive files. The captured content stays inside the WARC; what changes is its discoverability through Common Crawl's own interface. Anyone who already downloaded that crawl, or downloads it later in bulk, still has the original data.

When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset.— Rich Skrenta, Executive Director, Common Crawl Foundation

That technical design is exactly why the timelines look the way they do. After the Danish Rights Alliance filed its removal request in July 2024, Common Crawl's attorney reportedly wrote in December 2024 — more than six months later — that approximately 50% of the content had been removed. The pace is not a stall tactic on its face; it is what filtering an immutable archive at scale looks like. But it is also pace that serves the interests of the AI firms downstream, which is the uncomfortable part.

What removal actually means
When Common Crawl "removes" your content, it generally means the URLs stop appearing in future crawls and drop out of public indices. It does not mean the bytes already written into a historical WARC file are deleted. A November 2025 investigation by The Atlantic reportedly found New York Times and Danish publisher content still accessible after removal had been agreed, with file-system logs showing no content modifications since 2016 — a claim Common Crawl disputes. Treat the "no edits since 2016" detail as a reported finding, not settled fact.

Enforcement actions have moved bigger numbers than voluntary requests. In late November 2025, the Dutch anti-piracy organization BREIN forced Common Crawl to remove more than two million news articles after showing the database held Dutch news content published without permission. The lesson publishers are drawing: collective legal pressure shifts more content, faster, than a polite removal email ever has.

03The Escalation LadderSix stages, and what each one actually achieves.

Publishers have not jumped straight to lawyers. The fight has climbed a ladder, from settings a single site owner can change to demands a trade association makes on behalf of hundreds of newsrooms. The problem is that the rungs that are easiest to reach are the rungs that do nothing about archives that already exist. Map where you are on this ladder before you decide your content is "protected."

Publisher escalation ladder against Common Crawl and AI crawlers, grouped into technical, collective, and legal stages, with each stage's coverage, speed, enforceability, and impact on historical archives. Sources: BuzzStream survey, Cloudflare press materials, Common Crawl FAQ and blog, Press Gazette, the Danish Rights Alliance timeline, and the NMA and DCN letters, retrieved June 11, 2026.
StageCoverageTime to effectEnforceabilityArchive impact
Technical controls — you act, no permission needed
robots.txt disallow CCBotFuture crawls onlyNext crawl cycleVoluntaryNone
Opt-out registry listingFuture crawls onlyWeeks to monthsVoluntaryNone guaranteed
Server-side block (Cloudflare-level)Future crawls onlyImmediateEnforced at the edgeNone
Collective leverage — many publishers, one demand
Formal removal request letterExisting archives tooSix-plus monthsVoluntaryPartial, slow
Trade-association collective demandFuture and existingOpen-endedReputational pressureUnresolved
Legal escalation — lawyers, letters, and liability
Cease-and-desist / litigation threatFuture and existingLegal timelineLegalDisputed

Read the "Archive impact" column twice. Everything in the technical tier — the controls a publisher can deploy unilaterally and immediately — does nothing to existing archives. The actions that can touch historical content are the slow, contested, lawyer-driven ones at the bottom. That inversion is the whole frustration: the levers you control move the least, and the levers that move archives are the ones you cannot pull alone.

04How CCBot Reaches ContentThe paywall gap is architectural, not malicious.

A persistent claim in this debate is that CCBot "bypasses paywalls." Common Crawl denies it, and the denial is technically fair — but the outcome publishers worry about is real, and the reason is worth understanding precisely. CCBot identifies itself in its user agent and officially respects robots.txt disallow directives. The wrinkle is that CCBot never executes JavaScript.

Most modern paywalls are client-side: the full article HTML is delivered to the browser, and JavaScript then runs to hide the body and show a subscribe prompt. Because CCBot retrieves the raw HTML before any of that client-side logic executes, it captures the full article text without ever "defeating" the paywall in the traditional sense. It is not hacking anything; it is reading what the server sent before the gate closes. The effect, though, is that subscriber-only content can land in the archive.

The precise framing
Common Crawl does not break or bypass paywalls in the active sense. CCBot's no-JavaScript architecture means it captures full article HTML before client-side subscription checks run. The fix is server-side: if the protected content never reaches the client until a subscription is verified, a non-rendering crawler never sees it. Client-side-only paywalls are the exposure.

This is the single most actionable insight for a publisher reading this. If your premium content matters, the durable protection is not a robots.txt line — it is moving the paywall enforcement server-side so the body text is never sent to an unauthenticated request. A crawler that does not run JavaScript cannot reveal what the server never delivered.

05The Funding ConflictWho pays for the referee?

Common Crawl was founded in 2007 by Gil Elbaz in San Francisco and began publishing crawls publicly in 2011. For most of its life it was a modest operation. Its 2022 tax return reported $451,447 in revenue against $170,140 in expenses — funded almost entirely by a $450,000 donation from the Elbaz Family Foundation. This was a small archive project, not an industrial pipeline.

Then the money changed. Common Crawl began receiving significant donations from AI companies in 2023; reporting indicates Anthropic and OpenAI each donated $250,000, with Andreessen Horowitz contributing $100,000 separately. Per a 2026 industry analysis, by 2024 more than 60% of Common Crawl's donations came from entities affiliated with generative AI companies. The most recent figures confirmed from an audited tax return are still the 2022 numbers; treat the 2024 donation share as an analysis estimate, not an audited fact.

Why this is the real story
No mainstream SEO or marketing publication has mapped the funding flow plainly, so here it is: the nonprofit that arbitrates publishers' content-removal requests is substantially funded by the AI firms that benefit when that content stays in the dataset. Critics call the arrangement "data laundering" — AI companies donate to a nonprofit that archives the web, then use those archives while keeping a layer between themselves and direct copyright liability. We are describing the structural tension, not alleging intent.

The scale on the other side of that funding is enormous. Common Crawl has archived roughly 9.5 petabytes of data as of mid-2023 and captures more than two billion web pages per monthly crawl, generating around 250 terabytes per cycle. That archive is load-bearing for the AI industry: a February 2024 Mozilla Foundation study found that 64% of the large language models it reviewed used at least one filtered version of Common Crawl for pre-training. GPT-3's training data was, by various accounts, approximately 60 to 80 percent derived from Common Crawl — a range, not a precise figure, because the published sources disagree.

06The Cost AsymmetryWhen crawlers take everything and send back almost nothing.

The deepest reason publishers stopped tolerating open crawling is not legal — it is arithmetic. In the open-web bargain that held for two decades, search engines crawled your pages and sent you traffic in return. AI training crawls break that exchange. One 2026 analysis put the imbalance starkly: in July 2025, Anthropic reportedly crawled tens of thousands of pages for every single visit it referred, while OpenAI sent roughly one referral for every thousand-plus pages crawled. The same analysis estimated that training-related crawling accounts for close to 80% of all AI bot activity.

Numbers at that scale are worth a caveat — they come from third-party analysis of bot traffic, not from the labs' own disclosures, so treat the exact ratios as directional. But the direction is the point. When the crawl-to-referral ratio runs into the thousands, the value that historically justified letting bots in has effectively gone to zero. This is the accelerant behind the accelerating zero-click crisis — content gets consumed, answers get synthesized, and the originating site never sees the visitor.

How aggressively top news sites now block AI crawlers

Source: BuzzStream publisher survey and multiple SEO-industry trackers, 2025
Top news sites blocking AI training botsrobots.txt disallow on training crawlers
79%
Top news sites blocking AI retrieval botsblocking answer-engine retrieval, not just training
71%
Top news sites blocking PerplexityBotnamed answer-engine crawler
67%
Reputable sites blocking AI (May 2025)up from 23% in September 2023
~60%

The blocking wave is real and accelerating: reputable sites blocking AI crawlers rose from 23% in September 2023 to roughly 60% by May 2025, with the most defensive sites forbidding an average of more than fifteen distinct AI user agents. Infrastructure followed. On July 1, 2025, Cloudflare became the first major internet infrastructure provider to flip AI scraping to opt-in — block by default — and reported that more than a million customers subsequently chose to block AI crawlers, with hundreds of billions of AI bot requests blocked since.

07Blocking vs VisibilityThe uncomfortable tradeoff nobody wants to name.

Blocking is not free. Research published in early 2026 found that publishers who blocked AI crawlers via robots.txt experienced roughly a 23.1% decline in monthly visits, with no corresponding drop in AI citations of their content. Read that carefully: correlation is not causation — publishers who choose to block may differ systematically from those who do not, and the figure should not be read as proof that blocking caused the decline. But it does describe a real tension publishers are living with.

And the opt-out standard everyone is leaning on is itself eroding. As of Q1 2025, an estimated 12.9% of bots ignored robots.txt entirely, up from 3.3% in the prior period. A directive that a growing share of crawlers simply disregards is a weak foundation for a content-rights strategy. This is precisely the bind covered in our analysis of whether blocking AI responses helps or hurts referral traffic — the decision is rarely as clean as "block and protect."

The block case
Stop the bleed
robots.txt + server-side enforcement

Blocking ends future training crawls and signals intent for any eventual licensing negotiation. For premium and subscriber content, server-side paywall enforcement is the durable control. Necessary if your content is the product.

Protects future crawls only
The visibility case
Stay discoverable
selective allow + retrieval bots

Answer engines increasingly mediate discovery. Blanket-blocking retrieval bots can cut citations and the referral trickle that still flows. Some publishers allow retrieval while blocking training — a split that needs per-bot rules.

Keeps answer-engine presence

08What To DoA realistic posture, by content type.

There is no single correct stance — the right move depends on what your content is worth and how you make money from it. What follows is a decision frame, not a directive. The constant across all four cases: blocking governs the future, server-side enforcement governs premium exposure, and nothing on the technical tier touches what is already archived.

Subscriber / paywalled content
Move enforcement server-side

A robots.txt line will not protect premium text from a non-rendering crawler that reads the HTML before client-side checks run. Gate the body on the server so unauthenticated requests never receive it. This is the single highest-leverage fix.

Server-side paywall first
Open editorial / news
Split training vs retrieval

Block training crawlers to preserve licensing leverage while selectively allowing retrieval bots that still drive discovery and the residual referral. This needs per-user-agent rules, not a blanket disallow.

Per-bot policy
Already-archived content
Set realistic expectations

Accept that anything crawled before you blocked may persist in historical WARC files. Registry listing and removal requests help at the margins and slowly; collective legal pressure moves more, as the BREIN action showed.

Collective leverage
Rights and licensing strategy
Treat this as a business question

The endgame is licensing, not blocking. Blocking establishes the baseline that says inclusion requires permission. Track the DCN and NMA letters as precedent, and decide whether you negotiate, litigate, or license.

Plan for licensing

For most organizations, the practical sequence is: fix server-side enforcement on anything premium, set per-bot crawler rules deliberately rather than reflexively, and treat already-archived content as a known loss you manage rather than a problem you solve overnight. If you want help translating that into a concrete crawler policy and content-protection posture, our agentic SEO engagements build exactly this kind of AI-era visibility and rights strategy, and our content engine keeps the publishing operation aligned with it.

09ConclusionThe fight is about consent, not just crawlers.

The shape of the rights landscape, June 2026

Blocking governs the future; the archive governs the past; consent is the question underneath both.

The DCN cease-and-desist is the moment the publisher-versus-Common- Crawl dispute stopped being a standards conversation and became a rights fight. The legal argument — that copyright is not an opt-out regime — is the part that could outlast any single letter. The technical reality, that immutable archives make removal structurally slow, is the part publishers most often misunderstand when they assume a robots.txt line has protected them.

Our reading is that the blocking debate, while necessary, is a sideshow to the real one. The structural tension worth naming is the funding: a nonprofit that adjudicates removal requests while drawing the majority of its budget from the firms that profit from inclusion. That is not an accusation of bad faith; it is a conflict that exists regardless of anyone's intentions, and it is why publishers have stopped trusting voluntary processes and started reaching for lawyers.

Looking forward, expect the center of gravity to keep moving from technical controls toward collective and legal leverage — and, eventually, toward licensing. The crawl-to-referral math has made the old open-web bargain untenable for content businesses, and the opt-out standard is eroding as more crawlers ignore it. The publishers who navigate this well will be the ones who fix server-side enforcement now, set deliberate per-bot policy, accept what is already archived, and treat the whole question as a business-rights strategy rather than a robots.txt edit.

Protect your content in the AI-crawler era

Turn the crawler debate into a concrete content-protection plan.

We help publishers and content businesses build AI-era crawler policy, server-side content protection, and visibility strategy — translating the rights debate into a concrete plan you can ship in days, not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

AI-crawler and content-rights engagements

  • Per-bot crawler policy — training vs retrieval rules
  • Server-side paywall enforcement against non-rendering bots
  • AI visibility strategy that survives zero-click search
  • Content-rights posture — block, license, or negotiate
  • Monitoring AI bot traffic and crawl-to-referral economics
FAQ · Publishers vs Common Crawl

The questions publishers ask every week.

On June 3, 2026, Digital Content Next (DCN) sent Common Crawl a cease-and-desist letter on behalf of members including the Associated Press, the New York Times, NBC Universal, Bloomberg, NPR, and Fox. The letter demands two things: that Common Crawl stop scraping protected publisher content, and that it remove member content already in its datasets, including paywalled and subscriber-only news articles. DCN CEO Jason Kint announced the letter publicly on June 4. Importantly, as of mid-June 2026 this is a cease-and-desist letter, not a lawsuit — no litigation has been filed against Common Crawl by DCN. The core legal argument is that copyright is not an opt-out regime, meaning Common Crawl should need permission before including content rather than publishers needing to request exclusion.