SEODecision Matrix10 min readPublished June 4, 2026

Block training, keep citations · 8 crawlers · robots.txt + WAF

AI Crawler Access Control: The 2026 Decision Matrix

Eight major AI crawlers, eight different calls. The split that changes everything is training bot versus search indexer — GPTBot is not OAI-SearchBot, ClaudeBot is not Claude-SearchBot. This is the bot-by-bot matrix for blocking training crawlers without forfeiting the AI-search citations that increasingly drive qualified traffic.

DA
Digital Applied Team
Senior strategists · Published June 4, 2026
PublishedJune 4, 2026
Read time10 min
SourcesVendor docs + Cloudflare research
Anthropic crawl-to-referral
70,900:1
peak, June 2025
vs Googlebot ~5:1
AI bot activity that is training
~82%
up from 72% a year prior
News publishers blocking a search bot
71%
often by accident
lost citations
Top sites actively blocking AI bots
2.98%
of the top 1M, July 2024

AI crawler access control is no longer a single switch you flip on or off. Eight major AI crawlers each demand a separate decision, and the distinction that changes everything is purpose: a crawler that harvests your pages for model training is a fundamentally different actor from one that indexes you for AI search answers. Block them as one bucket and you can quietly delete yourself from the fastest-growing referral channel of 2026.

The reason this matters now is that the major AI vendors have split their crawlers in two. OpenAI runs GPTBot for training and OAI-SearchBot for ChatGPT search. Anthropic runs three separate bots. Amazon, Google, and Apple each separate training access from search and assistant access. The robots.txt rule that blocks one no longer blocks the other — which means the old "block all AI bots" advice is now actively harmful to visibility.

This guide gives you the full bot-by-bot decision matrix, the economics behind why blocking training crawlers makes sense, the five control levers ranked by enforcement strength, and a copy-ready 2026 configuration. Every claim is sourced to the operator's own documentation or to Cloudflare's published network research.

Key takeaways
  1. 01
    Training and search are now separate bots.GPTBot ≠ OAI-SearchBot, ClaudeBot ≠ Claude-SearchBot, Amazonbot ≠ Amzn-SearchBot. Each has its own user-agent and can be controlled independently in robots.txt. Treating them as one bucket is the core mistake.
  2. 02
    Block training, keep search citations.The defensible default is to disallow training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended) while allowing search and retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) that send referral traffic.
  3. 03
    Anthropic is the crawl-to-referral outlier.Cloudflare measured Anthropic crawling roughly 70,900 pages per referred visitor at its June 2025 peak, versus about 5:1 for traditional Googlebot. That asymmetry is the economic case for blocking ClaudeBot training.
  4. 04
    WAF rules override robots.txt.A Cloudflare or firewall rule is enforced before robots.txt is even read, so a WAF block beats any robots.txt Allow. For non-compliant crawlers like Bytespider, IP/WAF blocking is the only reliable defense.
  5. 05
    llms.txt is not a training opt-out.llms.txt is a Markdown guide that helps LLMs navigate your site at inference time. It does not control training or crawl permissions. robots.txt governs access; llms.txt governs navigation. Do not confuse the two.

01The Core SplitOne word changes the whole decision: purpose.

Every AI crawler does one of three jobs. It collects pages for model training, it indexes pages for AI search answers, or it fetches a page in real time because a user asked the assistant a question right now. These are different commercial relationships, and as of 2026 the major vendors expose them as different bots with different user-agent strings.

OpenAI's documentation is explicit: GPTBot is the training crawler, and disallowing it "indicates a site's content should not be used in training generative AI foundation models." OAI-SearchBot is a separate crawler that builds the ChatGPT search index. Block OAI-SearchBot and your site will not appear in ChatGPT search answers, even though GPTBot and OAI-SearchBot are run by the same company. Anthropic goes further still, running three distinct bots — ClaudeBot for training, Claude-SearchBot for search indexing, and Claude-User for real-time user-initiated fetches — each independently controllable in robots.txt.

The practical consequence is that the instinctive move — a blanket Disallow: /for every AI user-agent — now does two things at once: it opts you out of training corpora (often the goal) and it removes you from AI search results (almost never the goal). Search Engine Journal's coverage of Anthropic's granular bot framework reports that roughly 71% of top news publishers block at least one retrieval or search bot, frequently while intending only to block training. That is the exact error this matrix is designed to prevent.

The distinction that does the work
A training crawler turns your content into model weights you are never credited for. A search crawler turns your content into a cited answer that can send a visitor back to you. Blocking the first while allowing the second is the entire strategy — and it is only possible because the vendors finally separated the two.

02The Decision MatrixEight crawlers, eight different calls.

Below is the flagship asset: a bot-by-bot matrix that pairs each crawler's exact user-agent string with its purpose and the recommended 2026 default. The pattern is consistent — block the training bots, allow the search and retrieval bots — with two exceptions worth understanding before you copy anything into production.

Crawler · user-agent
GPTBot/1.3
Operator · purpose
OpenAI · model training
2026 default
BLOCK in robots.txt. Disallowing it signals content should not be used to train foundation models. Has no effect on ChatGPT search visibility. IP ranges at openai.com/gptbot.json.
Crawler · user-agent
OAI-SearchBot/1.3
Operator · purpose
OpenAI · search indexing
2026 default
ALLOW. This is the crawler behind ChatGPT search answers. Block it and you disappear from ChatGPT search citations. IP ranges at openai.com/searchbot.json.
Crawler · user-agent
ClaudeBot
Operator · purpose
Anthropic · model training
2026 default
BLOCK. Anthropic honors Disallow and Crawl-delay and will not bypass CAPTCHAs. Verify IPs at claude.com/crawling/bots.json. (The old anthropic-ai and Claude-Web agents are deprecated.)
Crawler · user-agent
Claude-SearchBot
Operator · purpose
Anthropic · search indexing
2026 default
ALLOW. Separate from ClaudeBot. This is the bot that powers Claude's web answers and sends the small amount of referral traffic Anthropic does return.
Crawler · user-agent
Google-Extended
Operator · purpose
Google · Gemini training (token)
2026 default
BLOCK if opting out of Gemini training. It is a robots.txt control token, not an HTTP user-agent. Google states it does not affect Search inclusion or ranking.
Crawler · user-agent
Applebot-Extended/1.0
Operator · purpose
Apple · foundation-model training
2026 default
BLOCK if opting out of Apple Intelligence training. Distinct from standard Applebot (Siri web results). Blocking it does not affect Apple Search or Spotlight.
Crawler · user-agent
CCBot/2.0
Operator · purpose
Common Crawl · open corpus
2026 default
BLOCK. Its archive has trained nearly every major LLM, so robots.txt is the primary opt-out. Now also an IP-dispute vector after the April 2026 News/Media Alliance demand letter.
Crawler · user-agent
Bytespider
Operator · purpose
ByteDance · model training
2026 default
BLOCK at the WAF/IP level, not just robots.txt. Independently reported as inconsistently compliant, with no official docs, IP-range file, or robots.txt policy published.
Crawler · user-agent
PerplexityBot · Perplexity-User
Operator · purpose
Perplexity · search + real-time
2026 default
ALLOW both. PerplexityBot builds the search index; Perplexity-User fetches in real time. Blocking either removes you from Perplexity answers, a channel growing fast in 2026.
Crawler · user-agent
Amazonbot/0.1 · Amzn-SearchBot
Operator · purpose
Amazon · training + search
2026 default
BLOCK Amazonbot (may train AI models). ALLOW Amzn-SearchBot, which improves Alexa/Rufus search and explicitly does not crawl for generative AI training.
Two exceptions to read twice
Most rows follow the rule cleanly. The exceptions: Bytespider ignores robots.txt inconsistently, so it needs a WAF or IP block rather than a polite Disallow; and Google-Extended is a control token, not a real user-agent, so it never appears in your server logs as an HTTP request — it only governs whether Google may use already-crawled pages for Gemini.

03The EconomicsWhy the crawl-to-referral gap makes the case.

The business argument for blocking training crawlers comes down to a single ratio: how many of your pages a bot crawls for every one visitor it sends back. Cloudflare publishes this crawl-to-referral ratio across its network, and the spread between vendors is extraordinary. Traditional Googlebot sits at roughly 5 pages crawled per referral. Anthropic, at its June 2025 peak, was crawling about 70,900 pages for every visitor referred — an asymmetry that reframes ClaudeBot training access as a one-way extraction of value.

Crawl-to-referral ratio · lower is fairer to publishers

Source: Cloudflare network research, 2025
Anthropic (peak)Pages crawled per referred visitor · June 2025
70,900:1
OpenAIPages crawled per referred visitor · July 2025
1,091:1
PerplexityPages crawled per referred visitor · July 2025
195:1
Traditional GooglebotPages crawled per referred visitor · for scale
~5:1

Two caveats keep this honest. First, the 70,900:1 figure is Anthropic's peak in the week of June 19-26, 2025; by July 2025 it had improved substantially — reportedly by around 87% to roughly 38,000:1 — after Anthropic shipped web-search features. The direction of travel matters, but even the improved ratio is orders of magnitude worse than Googlebot. Second, Cloudflare's own framing of the broader trend is blunt about where this is heading.

"The trend continues to be more crawls and fewer referrals when compared in relation to each other."— Cloudflare, network crawl-data research, July 2025

The other half of the economics is the upside you protect by not blocking search crawlers. Cloudflare reports that training now drives roughly 82% of all AI bot activity (up from about 72% a year earlier) while search-based crawling fell to around 15%. That is the macro signal: the volume hammering your servers is overwhelmingly training, not the search indexing that sends traffic back. Meanwhile AI-referred traffic is reportedly growing fast and tends to convert better than generic organic search — so the search crawlers are the cheap, high-value half of the equation that the blanket-block crowd is throwing away.

Training share of AI bots
Where the load comes from
82%

Cloudflare measured training at roughly 82% of AI bot activity by July 2025, up from about 72% a year earlier, while search crawling fell to around 15%. Most of the burden is the half that gives nothing back.

Search ~15%
AI referral growth
Year-over-year, reportedly
975%

AI referral traffic is reported to have grown roughly 975% from January 2025 to January 2026. The exact figure varies by source, but the trajectory is steep — which is why deleting yourself from AI search is costly.

Jan 2025 → Jan 2026
Active blockers
Of the top 1M sites
2.98%

As of July 2024, only about 2.98% of the top million sites on Cloudflare's network actively blocked AI bot requests, even though AI bots accessed roughly 39% of those properties. Most sites had no policy at all.

July 2024

04Control LeversFive levers, ranked by enforcement strength.

robots.txt is the polite request layer — well-behaved crawlers honor it, but it has no teeth against bots that choose to ignore it. The second proprietary asset below ranks the five control mechanisms by how hard they actually enforce, because the critical detail is buried in Cloudflare's documentation: a WAF or firewall rule is evaluated before robots.txt is ever read, so a WAF block overrides any robots.txt Allow.

Control lever
robots.txt Disallow
Enforcement · scope
Voluntary · site-wide by user-agent
When to use it
The primary lever for compliant bots (GPTBot, ClaudeBot, CCBot, Google-Extended). SEO-safe and free. Limitation: zero enforcement against bots that ignore it.
Control lever
X-Robots-Tag: noai
Enforcement · scope
Voluntary · per-page or header
When to use it
Page-level signal (noai/noimageai) some vendors honor. Useful for granular opt-outs. Limitation: a DeviantArt community convention, not an IETF/W3C standard, so reliability varies.
Control lever
Cloudflare AI Crawl Control
Enforcement · scope
Hard block · per-crawler by purpose
When to use it
Dashboard-managed rules that block by purpose category and report robots.txt-violation metrics. Creates a WAF rule on the zone. The pragmatic default for non-engineers.
Control lever
WAF / firewall custom rule
Enforcement · scope
Hard block · enforced before robots.txt
When to use it
The real teeth. Required for Bytespider and any crawler ignoring robots.txt. A WAF block overrides a robots.txt Allow because it runs first. Risk: misconfiguration can block humans.
Control lever
Server-level IP block
Enforcement · scope
Hard block · granular by IP range
When to use it
Lowest level, highest certainty when vendors publish IP-range files (OpenAI, Anthropic, Amazon, Common Crawl). Limitation: brittle as IP ranges rotate; needs maintenance.
The detail that breaks naive configs
If you set Allowfor OAI-SearchBot in robots.txt but a managed WAF rule is blocking "all AI crawlers," the WAF wins and you are still excluded from ChatGPT search. Order of evaluation matters: WAF first, robots.txt second. Always reconcile the two layers before assuming your search crawlers are getting through.

For most teams the right combination is robots.txt for the compliant training bots, Cloudflare AI Crawl Control (or an equivalent managed ruleset) for purpose-level enforcement, and a targeted WAF rule for Bytespider specifically. Cloudflare's one-click "Block AI bots" managed rule, available on all plans including free since July 2024, is a reasonable floor — but verify that it is not silently blocking the search crawlers you want to keep. If you are auditing an existing setup, AI crawler directives belong as a dedicated category in any technical SEO audit checklist.

05The llms.txt MythWhat llms.txt is not.

One of the most persistent misconceptions in this space is that adding an /llms.txtfile gives you control over AI training. It does not. The llms.txt specification defines a Markdown file that helps an LLM efficiently navigate your site's content during a user session — it is an inference-time convenience, the equivalent of a curated sitemap written for a model rather than a search engine. It carries no access or training permissions whatsoever.

Keep the mental model clean: robots.txt governs crawl and access permissions; llms.txt governs how a model finds its way around once it is already reading your pages. Publishing llms.txt is a worthwhile move for AI-search visibility and answer quality — but if your goal is to opt out of training, llms.txt does nothing for you and robots.txt plus a WAF rule does everything. For the file format and how to structure it for inference-time navigation, see our companion guide to the llms.txt specification, and for the foundational access-control mechanics, the robots.txt and meta robots reference.

Say it plainly
Adding /llms.txt does not opt your site out of AI training. It is a navigation guide for inference, full stop. If a tool or vendor implies otherwise, treat that as a red flag.

Common Crawl's CCBot has historically been treated as a passive archiver — a non-profit whose corpus happens to underpin nearly every major LLM, from GPT-class models to LLaMA and Mistral. That framing changed on April 29, 2026, when the News/Media Alliance sent a formal demand letter to Common Crawl's executive director, calling for removal of publisher content, revised terms explicitly prohibiting AI training use, and enforceable opt-out mechanisms. Signatories included NBCUniversal, CNN, McClatchy, Vox Media, Ziff Davis, and USA Today.

The practical takeaway is that blocking CCBot is no longer purely a technical-hygiene decision. For publishers, it has become a precautionary intellectual-property position — increasingly taken on legal advice — because the corpus is now contested ground. If your content has commercial value as licensable IP, disallowing CCBot in robots.txt is the documented opt-out, and doing so early establishes a clear record of intent.

Original analysis · where this is heading
The CCBot dispute is a preview, not an outlier. As AI-search referral becomes a measurable revenue line and training corpora become litigated assets, expect the publisher posture to harden into a standard two-track policy: aggressively block training, deliberately court search. The sites that win the next two years will be the ones that drew that line cleanly in 2026 rather than the ones still running a blunt block-everything robots.txt — or, worse, no policy at all.

07The ConfigurationThe defensible 2026 default.

Here is the configuration that follows from the matrix: block the training crawlers in robots.txt, allow the search and retrieval crawlers, and back it with a WAF layer for the bots that do not respect the file. Adjust per-site — a documentation-heavy SaaS may weigh AI-search visibility more heavily than a paywalled publisher guarding licensable IP — but this is the sensible starting point.

Block · robots.txt
Training crawlers
Disallow: / per user-agent

GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Amazonbot. These collect for model training and give nothing back. robots.txt is the documented opt-out for all of them.

Opt out of training corpora
Allow · robots.txt
Search crawlers
Allow: / per user-agent

OAI-SearchBot, Claude-SearchBot, PerplexityBot, Perplexity-User, Amzn-SearchBot. These index you for AI answers and send referral traffic. Keep them in.

Preserve AI-search citations
Hard block · WAF
Non-compliant bots
WAF rule + IP ranges

Bytespider, plus anything your logs show ignoring robots.txt. A WAF rule runs before robots.txt, so it actually enforces. Layer IP-range blocks where vendors publish them.

The only reliable defense

A few operational notes. Place the most specific user-agent rules first; some crawlers match the longest applicable directive, others the first. Keep the deprecated Anthropic agents (anthropic-ai, Claude-Web) out of your file — they are no longer active, and citing them gives readers and tools broken instructions. And remember that Pay Per Crawl, Cloudflare's HTTP 402 "pay-to-crawl" model that launched in private beta on June 1, 2025 and reached general availability in August 2025, is now a third path beyond block-or-allow: charge bots a micro-fee for access rather than refusing them outright.

08ImplicationsWhat this means for your team.

The crawler decision is not one-size-fits-all — it depends on what your content is worth as training data versus how much you stand to gain from AI-search referral. Four common profiles, four different calls.

Publishers · media
Licensable IP

Block all training crawlers including CCBot as a precautionary IP position, and consider Pay Per Crawl. Allow search bots so you keep citation visibility. This is the two-track posture the NMA letter is pushing toward.

Block training, court search
SaaS · docs sites
AI-search visibility

Lean toward allowing search and retrieval crawlers aggressively — AI answers are a discovery channel for documentation. Still block training bots, and publish an llms.txt to improve how models navigate your docs.

Maximize search access
Ecommerce
Conversion-led

AI referral traffic reportedly converts well, so allow search crawlers and measure the channel in GA4. Block training bots that scrape catalog and pricing data without sending shoppers back.

Allow search, block training
Any site
No policy today

If you have no AI crawler policy at all, you are in the ~97% majority and almost certainly being trained on by default. Start with the robots.txt block-training / allow-search config, then add a WAF layer for Bytespider.

Adopt the default now

Whichever profile fits, the mechanics are the same: separate training from search, enforce at the right layer, and measure the AI-referral channel so the decision is data-led rather than reflexive. If you want this configured and monitored as part of a broader technical-SEO program — robots.txt, WAF rules, llms.txt, and AI-search measurement in one engagement — that is exactly the work our agentic SEO service is built around, and crawler governance fits naturally alongside an AI transformation roadmap.

09ConclusionThe blanket block is the expensive mistake.

AI crawler access control, June 2026

Block training crawlers, keep your AI-search citations — they are different bots now.

The single most consequential shift in AI crawler control is that the major vendors split training from search. GPTBot is not OAI-SearchBot. ClaudeBot is not Claude-SearchBot. Amazonbot is not Amzn-SearchBot. That split is what makes a precise policy possible — and what makes the old block-everything advice a quiet, self-inflicted loss of visibility.

The economics back the precise approach. Cloudflare's data shows training driving the overwhelming majority of AI bot load, with crawl-to-referral ratios that, for some vendors, run into the tens of thousands of pages per visitor. Blocking the training half costs you almost nothing in referral traffic; keeping the search half open preserves the fastest-growing discovery channel of the year. The two decisions are finally separable, so make them separately.

Set the policy now. The defensible 2026 default is straightforward: disallow the training crawlers in robots.txt, allow the search and retrieval crawlers, enforce Bytespider and other non-compliant bots at the WAF, and treat llms.txt as the inference-navigation aid it is — never as a training opt-out. Roughly 97% of top sites still have no policy at all. Drawing the line cleanly this year is the cheap move that compounds.

Govern AI crawlers without losing visibility

Block the training crawlers, keep the citations that convert.

Our team configures and monitors AI crawler policy as part of a full technical-SEO program — robots.txt, WAF rules, llms.txt, and AI-search measurement — so you block training crawlers without forfeiting the citations that send qualified traffic.

Free consultationExpert guidanceTailored solutions
What we work on

AI crawler & AI-search engagements

  • Bot-by-bot robots.txt and WAF policy tuned to your goals
  • AI-search citation visibility across ChatGPT, Claude, Perplexity
  • llms.txt authoring for inference-time navigation
  • AI-referral measurement and conversion tracking in GA4
  • Training opt-out and IP-protection posture for publishers
FAQ · AI crawler control

The questions we get every week.

A training crawler collects your pages to build the dataset a model is trained on — your content becomes part of the model's weights, usually without attribution or referral. A search crawler indexes your pages so the AI assistant can cite you in answers and link visitors back to your site. They are now run as separate bots: GPTBot (training) versus OAI-SearchBot (search) at OpenAI, and ClaudeBot (training) versus Claude-SearchBot (search) at Anthropic. The key implication is that you can block one without blocking the other, which is the entire basis of a smart 2026 policy: opt out of training while staying eligible for AI-search citations.