The publishers versus Common Crawl fight stopped being a quiet standards dispute on June 3, 2026, when Digital Content Next sent the web-archiving nonprofit a cease-and-desist on behalf of the Associated Press, the New York Times, NBC Universal, Bloomberg, NPR, and Fox. The letter does two things at once: it demands Common Crawl stop scraping protected content, and it demands removal of member content already sitting in the datasets that AI labs train on.
What makes this more than another opt-out skirmish is the argument underneath it. DCN's position is that copyright is not an opt-out regime — publishers should not have to ask to be excluded; Common Crawl should have to ask before it includes them. That reframes the entire crawler-access debate, and it lands at the same moment the technical reality of removal is becoming impossible to ignore.
This is a news analysis and a rights-landscape readout, not a blocking how-to. If you want the mechanics — the exact robots.txt and llms.txt syntax for stopping CCBot and other AI crawlers — our AI crawler access-control decision matrix covers that ground. Here we explain why the cease-and-desist matters, why deletion is structurally slow, and what a publisher can realistically expect at each rung of the escalation ladder.
- 01A trade body went legal on behalf of six major newsrooms.DCN's June 3, 2026 cease-and-desist represents AP, NYT, NBC, Bloomberg, NPR, and Fox. It demands Common Crawl both stop scraping protected content and remove member content already in its datasets, including paywalled articles.
- 02Removal is hard by architecture, not by stubbornness.Common Crawl's archives use the immutable WARC file format. Editing a published archive without breaking its integrity is structurally difficult, which is why Danish publishers reportedly waited six-plus months for roughly 50% removal.
- 03Blocking CCBot is necessary but not sufficient.A robots.txt disallow stops future crawls, but historical crawls already in the archive are unaffected. Content taken before you blocked stays available to anyone who downloads the dataset.
- 04The funding conflict is the underreported story.Per a 2026 industry analysis, more than 60% of Common Crawl's 2024 donations came from AI-affiliated entities. The same nonprofit that arbitrates publishers' removal requests is substantially funded by the firms that benefit from inclusion.
- 05The crawl-to-referral math is the commercial case.One analysis found AI crawlers retrieving tens of thousands of pages for every visit they send back. When training crawls return almost no traffic, the value exchange that justified open crawling collapses.
01 — What HappenedA cease-and-desist, and the argument that reframes everything.
Digital Content Next sent its cease-and-desist letter to Common Crawl on June 3, 2026, and CEO Jason Kint announced it publicly the following day. The letter accuses Common Crawl of flagrantly infringing copyrighted content by creating and distributing datasets, then sharing them with AI companies knowing that those companies are actively reproducing protected material. It asks for two things: stop scraping, and remove what has already been taken — including paywalled and subscriber-only news articles.
The members standing behind the letter are not marginal players. DCN represents the Associated Press, the New York Times, NBC Universal, Bloomberg, NPR, and Fox. And the letter signals that DCN's lawyers are examining whether Common Crawl's prior statements to publishers were inaccurate or misleading — pointing to cases where the nonprofit reportedly confirmed it would comply with removal requests, then later cited technical costs as the reason full removal never happened.
Copyright law is not an opt-out regime.— Digital Content Next, cease-and-desist letter to Common Crawl, June 3, 2026
That single line is the whole thesis. The prevailing internet norm — crawl by default, honor opt-outs when asked — treats inclusion as the baseline and exclusion as the exception you have to request. DCN's argument inverts the burden: permission first, inclusion second. Whether that holds up legally is a separate question, but as a framing, it moves the fight from technical compliance to consent.
Common Crawl has not stayed silent. Executive Director Rich Skrenta denied misleading publishers and described a removal process that, in his telling, reflects how the dataset is actually built. He also declined to comment specifically on the DCN letter. The gap between his account and the publishers' experience is not really about good faith — it is about file formats, which is where the next section goes.
02 — The Removal ParadoxWhy deletion is structurally slow.
Most coverage frames this as Common Crawl refusing to remove content. The more accurate framing is that the archive was never designed to support removal. Common Crawl stores its captures in the WARC format — Web ARChive, the same standard libraries and archivists use worldwide. A WARC file is closer to a printed book than to an editable database: once a crawl is written, editing individual records after the fact risks compromising the integrity of the whole archive.
So when a publisher asks for removal, Common Crawl's stated approach is to filter the affected URLs out of subsequent crawls and make them inaccessible through its public tools and indices — but it does not delete the existing archive files. The captured content stays inside the WARC; what changes is its discoverability through Common Crawl's own interface. Anyone who already downloaded that crawl, or downloads it later in bulk, still has the original data.
When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset.— Rich Skrenta, Executive Director, Common Crawl Foundation
That technical design is exactly why the timelines look the way they do. After the Danish Rights Alliance filed its removal request in July 2024, Common Crawl's attorney reportedly wrote in December 2024 — more than six months later — that approximately 50% of the content had been removed. The pace is not a stall tactic on its face; it is what filtering an immutable archive at scale looks like. But it is also pace that serves the interests of the AI firms downstream, which is the uncomfortable part.
Enforcement actions have moved bigger numbers than voluntary requests. In late November 2025, the Dutch anti-piracy organization BREIN forced Common Crawl to remove more than two million news articles after showing the database held Dutch news content published without permission. The lesson publishers are drawing: collective legal pressure shifts more content, faster, than a polite removal email ever has.
03 — The Escalation LadderSix stages, and what each one actually achieves.
Publishers have not jumped straight to lawyers. The fight has climbed a ladder, from settings a single site owner can change to demands a trade association makes on behalf of hundreds of newsrooms. The problem is that the rungs that are easiest to reach are the rungs that do nothing about archives that already exist. Map where you are on this ladder before you decide your content is "protected."
| Stage | Coverage | Time to effect | Enforceability | Archive impact |
|---|---|---|---|---|
| Technical controls — you act, no permission needed | ||||
| robots.txt disallow CCBot | Future crawls only | Next crawl cycle | Voluntary | None |
| Opt-out registry listing | Future crawls only | Weeks to months | Voluntary | None guaranteed |
| Server-side block (Cloudflare-level) | Future crawls only | Immediate | Enforced at the edge | None |
| Collective leverage — many publishers, one demand | ||||
| Formal removal request letter | Existing archives too | Six-plus months | Voluntary | Partial, slow |
| Trade-association collective demand | Future and existing | Open-ended | Reputational pressure | Unresolved |
| Legal escalation — lawyers, letters, and liability | ||||
| Cease-and-desist / litigation threat | Future and existing | Legal timeline | Legal | Disputed |
Read the "Archive impact" column twice. Everything in the technical tier — the controls a publisher can deploy unilaterally and immediately — does nothing to existing archives. The actions that can touch historical content are the slow, contested, lawyer-driven ones at the bottom. That inversion is the whole frustration: the levers you control move the least, and the levers that move archives are the ones you cannot pull alone.
04 — How CCBot Reaches ContentThe paywall gap is architectural, not malicious.
A persistent claim in this debate is that CCBot "bypasses paywalls." Common Crawl denies it, and the denial is technically fair — but the outcome publishers worry about is real, and the reason is worth understanding precisely. CCBot identifies itself in its user agent and officially respects robots.txt disallow directives. The wrinkle is that CCBot never executes JavaScript.
Most modern paywalls are client-side: the full article HTML is delivered to the browser, and JavaScript then runs to hide the body and show a subscribe prompt. Because CCBot retrieves the raw HTML before any of that client-side logic executes, it captures the full article text without ever "defeating" the paywall in the traditional sense. It is not hacking anything; it is reading what the server sent before the gate closes. The effect, though, is that subscriber-only content can land in the archive.
This is the single most actionable insight for a publisher reading this. If your premium content matters, the durable protection is not a robots.txt line — it is moving the paywall enforcement server-side so the body text is never sent to an unauthenticated request. A crawler that does not run JavaScript cannot reveal what the server never delivered.
05 — The Funding ConflictWho pays for the referee?
Common Crawl was founded in 2007 by Gil Elbaz in San Francisco and began publishing crawls publicly in 2011. For most of its life it was a modest operation. Its 2022 tax return reported $451,447 in revenue against $170,140 in expenses — funded almost entirely by a $450,000 donation from the Elbaz Family Foundation. This was a small archive project, not an industrial pipeline.
Then the money changed. Common Crawl began receiving significant donations from AI companies in 2023; reporting indicates Anthropic and OpenAI each donated $250,000, with Andreessen Horowitz contributing $100,000 separately. Per a 2026 industry analysis, by 2024 more than 60% of Common Crawl's donations came from entities affiliated with generative AI companies. The most recent figures confirmed from an audited tax return are still the 2022 numbers; treat the 2024 donation share as an analysis estimate, not an audited fact.
The scale on the other side of that funding is enormous. Common Crawl has archived roughly 9.5 petabytes of data as of mid-2023 and captures more than two billion web pages per monthly crawl, generating around 250 terabytes per cycle. That archive is load-bearing for the AI industry: a February 2024 Mozilla Foundation study found that 64% of the large language models it reviewed used at least one filtered version of Common Crawl for pre-training. GPT-3's training data was, by various accounts, approximately 60 to 80 percent derived from Common Crawl — a range, not a precise figure, because the published sources disagree.
06 — The Cost AsymmetryWhen crawlers take everything and send back almost nothing.
The deepest reason publishers stopped tolerating open crawling is not legal — it is arithmetic. In the open-web bargain that held for two decades, search engines crawled your pages and sent you traffic in return. AI training crawls break that exchange. One 2026 analysis put the imbalance starkly: in July 2025, Anthropic reportedly crawled tens of thousands of pages for every single visit it referred, while OpenAI sent roughly one referral for every thousand-plus pages crawled. The same analysis estimated that training-related crawling accounts for close to 80% of all AI bot activity.
Numbers at that scale are worth a caveat — they come from third-party analysis of bot traffic, not from the labs' own disclosures, so treat the exact ratios as directional. But the direction is the point. When the crawl-to-referral ratio runs into the thousands, the value that historically justified letting bots in has effectively gone to zero. This is the accelerant behind the accelerating zero-click crisis — content gets consumed, answers get synthesized, and the originating site never sees the visitor.
How aggressively top news sites now block AI crawlers
Source: BuzzStream publisher survey and multiple SEO-industry trackers, 2025The blocking wave is real and accelerating: reputable sites blocking AI crawlers rose from 23% in September 2023 to roughly 60% by May 2025, with the most defensive sites forbidding an average of more than fifteen distinct AI user agents. Infrastructure followed. On July 1, 2025, Cloudflare became the first major internet infrastructure provider to flip AI scraping to opt-in — block by default — and reported that more than a million customers subsequently chose to block AI crawlers, with hundreds of billions of AI bot requests blocked since.
07 — Blocking vs VisibilityThe uncomfortable tradeoff nobody wants to name.
Blocking is not free. Research published in early 2026 found that publishers who blocked AI crawlers via robots.txt experienced roughly a 23.1% decline in monthly visits, with no corresponding drop in AI citations of their content. Read that carefully: correlation is not causation — publishers who choose to block may differ systematically from those who do not, and the figure should not be read as proof that blocking caused the decline. But it does describe a real tension publishers are living with.
And the opt-out standard everyone is leaning on is itself eroding. As of Q1 2025, an estimated 12.9% of bots ignored robots.txt entirely, up from 3.3% in the prior period. A directive that a growing share of crawlers simply disregards is a weak foundation for a content-rights strategy. This is precisely the bind covered in our analysis of whether blocking AI responses helps or hurts referral traffic — the decision is rarely as clean as "block and protect."
Stop the bleed
Blocking ends future training crawls and signals intent for any eventual licensing negotiation. For premium and subscriber content, server-side paywall enforcement is the durable control. Necessary if your content is the product.
Stay discoverable
Answer engines increasingly mediate discovery. Blanket-blocking retrieval bots can cut citations and the referral trickle that still flows. Some publishers allow retrieval while blocking training — a split that needs per-bot rules.
08 — What To DoA realistic posture, by content type.
There is no single correct stance — the right move depends on what your content is worth and how you make money from it. What follows is a decision frame, not a directive. The constant across all four cases: blocking governs the future, server-side enforcement governs premium exposure, and nothing on the technical tier touches what is already archived.
Move enforcement server-side
A robots.txt line will not protect premium text from a non-rendering crawler that reads the HTML before client-side checks run. Gate the body on the server so unauthenticated requests never receive it. This is the single highest-leverage fix.
Split training vs retrieval
Block training crawlers to preserve licensing leverage while selectively allowing retrieval bots that still drive discovery and the residual referral. This needs per-user-agent rules, not a blanket disallow.
Set realistic expectations
Accept that anything crawled before you blocked may persist in historical WARC files. Registry listing and removal requests help at the margins and slowly; collective legal pressure moves more, as the BREIN action showed.
Treat this as a business question
The endgame is licensing, not blocking. Blocking establishes the baseline that says inclusion requires permission. Track the DCN and NMA letters as precedent, and decide whether you negotiate, litigate, or license.
For most organizations, the practical sequence is: fix server-side enforcement on anything premium, set per-bot crawler rules deliberately rather than reflexively, and treat already-archived content as a known loss you manage rather than a problem you solve overnight. If you want help translating that into a concrete crawler policy and content-protection posture, our agentic SEO engagements build exactly this kind of AI-era visibility and rights strategy, and our content engine keeps the publishing operation aligned with it.
09 — ConclusionThe fight is about consent, not just crawlers.
Blocking governs the future; the archive governs the past; consent is the question underneath both.
The DCN cease-and-desist is the moment the publisher-versus-Common- Crawl dispute stopped being a standards conversation and became a rights fight. The legal argument — that copyright is not an opt-out regime — is the part that could outlast any single letter. The technical reality, that immutable archives make removal structurally slow, is the part publishers most often misunderstand when they assume a robots.txt line has protected them.
Our reading is that the blocking debate, while necessary, is a sideshow to the real one. The structural tension worth naming is the funding: a nonprofit that adjudicates removal requests while drawing the majority of its budget from the firms that profit from inclusion. That is not an accusation of bad faith; it is a conflict that exists regardless of anyone's intentions, and it is why publishers have stopped trusting voluntary processes and started reaching for lawyers.
Looking forward, expect the center of gravity to keep moving from technical controls toward collective and legal leverage — and, eventually, toward licensing. The crawl-to-referral math has made the old open-web bargain untenable for content businesses, and the opt-out standard is eroding as more crawlers ignore it. The publishers who navigate this well will be the ones who fix server-side enforcement now, set deliberate per-bot policy, accept what is already archived, and treat the whole question as a business-rights strategy rather than a robots.txt edit.