AI crawler access control is no longer a single switch you flip on or off. Eight major AI crawlers each demand a separate decision, and the distinction that changes everything is purpose: a crawler that harvests your pages for model training is a fundamentally different actor from one that indexes you for AI search answers. Block them as one bucket and you can quietly delete yourself from the fastest-growing referral channel of 2026.

The reason this matters now is that the major AI vendors have split their crawlers in two. OpenAI runs GPTBot for training and OAI-SearchBot for ChatGPT search. Anthropic runs three separate bots. Amazon, Google, and Apple each separate training access from search and assistant access. The robots.txt rule that blocks one no longer blocks the other — which means the old "block all AI bots" advice is now actively harmful to visibility.

This guide gives you the full bot-by-bot decision matrix, the economics behind why blocking training crawlers makes sense, the five control levers ranked by enforcement strength, and a copy-ready 2026 configuration. Every claim is sourced to the operator's own documentation or to Cloudflare's published network research.

Key takeaways

01
Training and search are now separate bots.GPTBot ≠ OAI-SearchBot, ClaudeBot ≠ Claude-SearchBot, Amazonbot ≠ Amzn-SearchBot. Each has its own user-agent and can be controlled independently in robots.txt. Treating them as one bucket is the core mistake.
02
Block training, keep search citations.The defensible default is to disallow training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended) while allowing search and retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) that send referral traffic.
03
Anthropic is the crawl-to-referral outlier.Cloudflare measured Anthropic crawling roughly 70,900 pages per referred visitor at its June 2025 peak, versus about 5:1 for traditional Googlebot. That asymmetry is the economic case for blocking ClaudeBot training.
04
WAF rules override robots.txt.A Cloudflare or firewall rule is enforced before robots.txt is even read, so a WAF block beats any robots.txt Allow. For non-compliant crawlers like Bytespider, IP/WAF blocking is the only reliable defense.
05
llms.txt is not a training opt-out.llms.txt is a Markdown guide that helps LLMs navigate your site at inference time. It does not control training or crawl permissions. robots.txt governs access; llms.txt governs navigation. Do not confuse the two.

01 — The Core SplitOne word changes the whole decision: purpose.

Every AI crawler does one of three jobs. It collects pages for model training, it indexes pages for AI search answers, or it fetches a page in real time because a user asked the assistant a question right now. These are different commercial relationships, and as of 2026 the major vendors expose them as different bots with different user-agent strings.

OpenAI's documentation is explicit: GPTBot is the training crawler, and disallowing it "indicates a site's content should not be used in training generative AI foundation models." OAI-SearchBot is a separate crawler that builds the ChatGPT search index. Block OAI-SearchBot and your site will not appear in ChatGPT search answers, even though GPTBot and OAI-SearchBot are run by the same company. Anthropic goes further still, running three distinct bots — ClaudeBot for training, Claude-SearchBot for search indexing, and Claude-User for real-time user-initiated fetches — each independently controllable in robots.txt.

The practical consequence is that the instinctive move — a blanket Disallow: / for every AI user-agent — now does two things at once: it opts you out of training corpora (often the goal) and it removes you from AI search results (almost never the goal). Search Engine Journal's coverage of Anthropic's granular bot framework reports that roughly 71% of top news publishers block at least one retrieval or search bot, frequently while intending only to block training. That is the exact error this matrix is designed to prevent.

The distinction that does the work

A training crawler turns your content into model weights you are never credited for. A search crawler turns your content into a cited answer that can send a visitor back to you. Blocking the first while allowing the second is the entire strategy — and it is only possible because the vendors finally separated the two.

02 — The Decision MatrixEight crawlers, eight different calls.

Below is the flagship asset: a bot-by-bot matrix that pairs each crawler's exact user-agent string with its purpose and the recommended 2026 default. The pattern is consistent — block the training bots, allow the search and retrieval bots — with two exceptions worth understanding before you copy anything into production.

Crawler · user-agent

GPTBot/1.3

Operator · purpose

OpenAI · model training

2026 default

BLOCK in robots.txt. Disallowing it signals content should not be used to train foundation models. Has no effect on ChatGPT search visibility. IP ranges at openai.com/gptbot.json.

Crawler · user-agent

OAI-SearchBot/1.3

Operator · purpose

OpenAI · search indexing

2026 default

ALLOW. This is the crawler behind ChatGPT search answers. Block it and you disappear from ChatGPT search citations. IP ranges at openai.com/searchbot.json.

Crawler · user-agent

ClaudeBot

Operator · purpose

Anthropic · model training

2026 default

BLOCK. Anthropic honors Disallow and Crawl-delay and will not bypass CAPTCHAs. Verify IPs at claude.com/crawling/bots.json. (The old anthropic-ai and Claude-Web agents are deprecated.)

Crawler · user-agent

Claude-SearchBot

Operator · purpose

Anthropic · search indexing

2026 default

ALLOW. Separate from ClaudeBot. This is the bot that powers Claude's web answers and sends the small amount of referral traffic Anthropic does return.

Crawler · user-agent

Google-Extended

Operator · purpose

Google · Gemini training (token)

2026 default

BLOCK if opting out of Gemini training. It is a robots.txt control token, not an HTTP user-agent. Google states it does not affect Search inclusion or ranking.

Crawler · user-agent

Applebot-Extended/1.0

Operator · purpose

Apple · foundation-model training

2026 default

BLOCK if opting out of Apple Intelligence training. Distinct from standard Applebot (Siri web results). Blocking it does not affect Apple Search or Spotlight.

Crawler · user-agent

CCBot/2.0

Operator · purpose

Common Crawl · open corpus

2026 default

BLOCK. Its archive has trained nearly every major LLM, so robots.txt is the primary opt-out. Now also an IP-dispute vector after the April 2026 News/Media Alliance demand letter.

Crawler · user-agent

Bytespider

Operator · purpose

ByteDance · model training

2026 default

BLOCK at the WAF/IP level, not just robots.txt. Independently reported as inconsistently compliant, with no official docs, IP-range file, or robots.txt policy published.

Crawler · user-agent

PerplexityBot · Perplexity-User

Operator · purpose

Perplexity · search + real-time

2026 default

ALLOW both. PerplexityBot builds the search index; Perplexity-User fetches in real time. Blocking either removes you from Perplexity answers, a channel growing fast in 2026.

Crawler · user-agent

Amazonbot/0.1 · Amzn-SearchBot

Operator · purpose

Amazon · training + search

2026 default

BLOCK Amazonbot (may train AI models). ALLOW Amzn-SearchBot, which improves Alexa/Rufus search and explicitly does not crawl for generative AI training.

Crawler · user-agent	Operator · purpose	2026 default
`GPTBot/1.3`	OpenAI · model training	BLOCK in robots.txt. Disallowing it signals content should not be used to train foundation models. Has no effect on ChatGPT search visibility. IP ranges at openai.com/gptbot.json.
`OAI-SearchBot/1.3`	OpenAI · search indexing	ALLOW. This is the crawler behind ChatGPT search answers. Block it and you disappear from ChatGPT search citations. IP ranges at openai.com/searchbot.json.
`ClaudeBot`	Anthropic · model training	BLOCK. Anthropic honors Disallow and Crawl-delay and will not bypass CAPTCHAs. Verify IPs at claude.com/crawling/bots.json. (The old anthropic-ai and Claude-Web agents are deprecated.)
`Claude-SearchBot`	Anthropic · search indexing	ALLOW. Separate from ClaudeBot. This is the bot that powers Claude's web answers and sends the small amount of referral traffic Anthropic does return.
`Google-Extended`	Google · Gemini training (token)	BLOCK if opting out of Gemini training. It is a robots.txt control token, not an HTTP user-agent. Google states it does not affect Search inclusion or ranking.
`Applebot-Extended/1.0`	Apple · foundation-model training	BLOCK if opting out of Apple Intelligence training. Distinct from standard Applebot (Siri web results). Blocking it does not affect Apple Search or Spotlight.
`CCBot/2.0`	Common Crawl · open corpus	BLOCK. Its archive has trained nearly every major LLM, so robots.txt is the primary opt-out. Now also an IP-dispute vector after the April 2026 News/Media Alliance demand letter.
`Bytespider`	ByteDance · model training	BLOCK at the WAF/IP level, not just robots.txt. Independently reported as inconsistently compliant, with no official docs, IP-range file, or robots.txt policy published.
`PerplexityBot · Perplexity-User`	Perplexity · search + real-time	ALLOW both. PerplexityBot builds the search index; Perplexity-User fetches in real time. Blocking either removes you from Perplexity answers, a channel growing fast in 2026.
`Amazonbot/0.1 · Amzn-SearchBot`	Amazon · training + search	BLOCK Amazonbot (may train AI models). ALLOW Amzn-SearchBot, which improves Alexa/Rufus search and explicitly does not crawl for generative AI training.

Two exceptions to read twice

Most rows follow the rule cleanly. The exceptions: Bytespider ignores robots.txt inconsistently, so it needs a WAF or IP block rather than a polite Disallow; and Google-Extended is a control token, not a real user-agent, so it never appears in your server logs as an HTTP request — it only governs whether Google may use already-crawled pages for Gemini.

03 — The EconomicsWhy the crawl-to-referral gap makes the case.

The business argument for blocking training crawlers comes down to a single ratio: how many of your pages a bot crawls for every one visitor it sends back. Cloudflare publishes this crawl-to-referral ratio across its network, and the spread between vendors is extraordinary. Traditional Googlebot sits at roughly 5 pages crawled per referral. Anthropic, at its June 2025 peak, was crawling about 70,900 pages for every visitor referred — an asymmetry that reframes ClaudeBot training access as a one-way extraction of value.

Crawl-to-referral ratio · lower is fairer to publishers

Source: Cloudflare network research, 2025

Anthropic (peak)Pages crawled per referred visitor · June 2025

70,900:1

OpenAIPages crawled per referred visitor · July 2025

1,091:1

PerplexityPages crawled per referred visitor · July 2025

195:1

Traditional GooglebotPages crawled per referred visitor · for scale

~5:1

Two caveats keep this honest. First, the 70,900:1 figure is Anthropic's peak in the week of June 19-26, 2025; by July 2025 it had improved substantially — reportedly by around 87% to roughly 38,000:1 — after Anthropic shipped web-search features. The direction of travel matters, but even the improved ratio is orders of magnitude worse than Googlebot. Second, Cloudflare's own framing of the broader trend is blunt about where this is heading.

"The trend continues to be more crawls and fewer referrals when compared in relation to each other."— Cloudflare, network crawl-data research, July 2025

The other half of the economics is the upside you protect by not blocking search crawlers. Cloudflare reports that training now drives roughly 82% of all AI bot activity (up from about 72% a year earlier) while search-based crawling fell to around 15%. That is the macro signal: the volume hammering your servers is overwhelmingly training, not the search indexing that sends traffic back. Meanwhile AI-referred traffic is reportedly growing fast and tends to convert better than generic organic search — so the search crawlers are the cheap, high-value half of the equation that the blanket-block crowd is throwing away. For the full picture on bot volume — with bots now generating 57.5% of web requests — see our AI crawler traffic data reference.

Training share of AI bots

Where the load comes from

82%

Cloudflare measured training at roughly 82% of AI bot activity by July 2025, up from about 72% a year earlier, while search crawling fell to around 15%. Most of the burden is the half that gives nothing back.

Search ~15%

AI referral growth

Year-over-year, reportedly

975%

AI referral traffic is reported to have grown roughly 975% from January 2025 to January 2026. The exact figure varies by source, but the trajectory is steep — which is why deleting yourself from AI search is costly.

Jan 2025 → Jan 2026

Active blockers

Of the top 1M sites

2.98%

As of July 2024, only about 2.98% of the top million sites on Cloudflare's network actively blocked AI bot requests, even though AI bots accessed roughly 39% of those properties. Most sites had no policy at all.

July 2024

04 — Control LeversFive levers, ranked by enforcement strength.

robots.txt is the polite request layer — well-behaved crawlers honor it, but it has no teeth against bots that choose to ignore it. The second proprietary asset below ranks the five control mechanisms by how hard they actually enforce, because the critical detail is buried in Cloudflare's documentation: a WAF or firewall rule is evaluated before robots.txt is ever read, so a WAF block overrides any robots.txt Allow.

Control lever

robots.txt Disallow

Enforcement · scope

Voluntary · site-wide by user-agent

When to use it

The primary lever for compliant bots (GPTBot, ClaudeBot, CCBot, Google-Extended). SEO-safe and free. Limitation: zero enforcement against bots that ignore it.

Control lever

X-Robots-Tag: noai

Enforcement · scope

Voluntary · per-page or header

When to use it

Page-level signal (noai/noimageai) some vendors honor. Useful for granular opt-outs. Limitation: a DeviantArt community convention, not an IETF/W3C standard, so reliability varies.

Control lever

Cloudflare AI Crawl Control

Enforcement · scope

Hard block · per-crawler by purpose

When to use it

Dashboard-managed rules that block by purpose category and report robots.txt-violation metrics. Creates a WAF rule on the zone. The pragmatic default for non-engineers.

Control lever

WAF / firewall custom rule

Enforcement · scope

Hard block · enforced before robots.txt

When to use it

The real teeth. Required for Bytespider and any crawler ignoring robots.txt. A WAF block overrides a robots.txt Allow because it runs first. Risk: misconfiguration can block humans.

Control lever

Server-level IP block

Enforcement · scope

Hard block · granular by IP range

When to use it

Lowest level, highest certainty when vendors publish IP-range files (OpenAI, Anthropic, Amazon, Common Crawl). Limitation: brittle as IP ranges rotate; needs maintenance.

Control lever	Enforcement · scope	When to use it
`robots.txt Disallow`	Voluntary · site-wide by user-agent	The primary lever for compliant bots (GPTBot, ClaudeBot, CCBot, Google-Extended). SEO-safe and free. Limitation: zero enforcement against bots that ignore it.
`X-Robots-Tag: noai`	Voluntary · per-page or header	Page-level signal (noai/noimageai) some vendors honor. Useful for granular opt-outs. Limitation: a DeviantArt community convention, not an IETF/W3C standard, so reliability varies.
`Cloudflare AI Crawl Control`	Hard block · per-crawler by purpose	Dashboard-managed rules that block by purpose category and report robots.txt-violation metrics. Creates a WAF rule on the zone. The pragmatic default for non-engineers.
`WAF / firewall custom rule`	Hard block · enforced before robots.txt	The real teeth. Required for Bytespider and any crawler ignoring robots.txt. A WAF block overrides a robots.txt Allow because it runs first. Risk: misconfiguration can block humans.
`Server-level IP block`	Hard block · granular by IP range	Lowest level, highest certainty when vendors publish IP-range files (OpenAI, Anthropic, Amazon, Common Crawl). Limitation: brittle as IP ranges rotate; needs maintenance.

The detail that breaks naive configs

If you set Allow for OAI-SearchBot in robots.txt but a managed WAF rule is blocking "all AI crawlers," the WAF wins and you are still excluded from ChatGPT search. Order of evaluation matters: WAF first, robots.txt second. Always reconcile the two layers before assuming your search crawlers are getting through.

For most teams the right combination is robots.txt for the compliant training bots, Cloudflare AI Crawl Control (or an equivalent managed ruleset) for purpose-level enforcement, and a targeted WAF rule for Bytespider specifically. Cloudflare's one-click "Block AI bots" managed rule, available on all plans including free since July 2024, is a reasonable floor — but verify that it is not silently blocking the search crawlers you want to keep. If you are auditing an existing setup, AI crawler directives belong as a dedicated category in any technical SEO audit checklist.

05 — The llms.txt MythWhat llms.txt is not.

One of the most persistent misconceptions in this space is that adding an /llms.txt file gives you control over AI training. It does not. The llms.txt specification defines a Markdown file that helps an LLM efficiently navigate your site's content during a user session — it is an inference-time convenience, the equivalent of a curated sitemap written for a model rather than a search engine. It carries no access or training permissions whatsoever.

Keep the mental model clean: robots.txt governs crawl and access permissions; llms.txt governs how a model finds its way around once it is already reading your pages. Publishing llms.txt is a worthwhile move for AI-search visibility and answer quality — but if your goal is to opt out of training, llms.txt does nothing for you and robots.txt plus a WAF rule does everything. For the file format and how to structure it for inference-time navigation, see our companion guide to the llms.txt specification, and for the foundational access-control mechanics, the robots.txt and meta robots reference.

Say it plainly

Adding /llms.txt does not opt your site out of AI training. It is a navigation guide for inference, full stop. If a tool or vendor implies otherwise, treat that as a red flag.

06 — The Legal FrontCCBot is now an IP dispute vector.

Common Crawl's CCBot has historically been treated as a passive archiver — a non-profit whose corpus happens to underpin nearly every major LLM, from GPT-class models to LLaMA and Mistral. That framing changed on April 29, 2026, when the News/Media Alliance sent a formal demand letter to Common Crawl's executive director, calling for removal of publisher content, revised terms explicitly prohibiting AI training use, and enforceable opt-out mechanisms. Signatories included NBCUniversal, CNN, McClatchy, Vox Media, Ziff Davis, and USA Today.

The practical takeaway is that blocking CCBot is no longer purely a technical-hygiene decision. For publishers, it has become a precautionary intellectual-property position — increasingly taken on legal advice — because the corpus is now contested ground. If your content has commercial value as licensable IP, disallowing CCBot in robots.txt is the documented opt-out, and doing so early establishes a clear record of intent.

Original analysis · where this is heading

The CCBot dispute is a preview, not an outlier. As AI-search referral becomes a measurable revenue line and training corpora become litigated assets, expect the publisher posture to harden into a standard two-track policy: aggressively block training, deliberately court search. The sites that win the next two years will be the ones that drew that line cleanly in 2026 rather than the ones still running a blunt block-everything robots.txt — or, worse, no policy at all.

07 — The ConfigurationThe defensible 2026 default.

Here is the configuration that follows from the matrix: block the training crawlers in robots.txt, allow the search and retrieval crawlers, and back it with a WAF layer for the bots that do not respect the file. Adjust per-site — a documentation-heavy SaaS may weigh AI-search visibility more heavily than a paywalled publisher guarding licensable IP — but this is the sensible starting point.

Block · robots.txt

Training crawlers

Disallow: / per user-agent

GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Amazonbot. These collect for model training and give nothing back. robots.txt is the documented opt-out for all of them.

Opt out of training corpora

Allow · robots.txt

Search crawlers

Allow: / per user-agent

OAI-SearchBot, Claude-SearchBot, PerplexityBot, Perplexity-User, Amzn-SearchBot. These index you for AI answers and send referral traffic. Keep them in.

Preserve AI-search citations

Hard block · WAF

Non-compliant bots

WAF rule + IP ranges

Bytespider, plus anything your logs show ignoring robots.txt. A WAF rule runs before robots.txt, so it actually enforces. Layer IP-range blocks where vendors publish them.

The only reliable defense

A few operational notes. Place the most specific user-agent rules first; some crawlers match the longest applicable directive, others the first. Keep the deprecated Anthropic agents (anthropic-ai, Claude-Web) out of your file — they are no longer active, and citing them gives readers and tools broken instructions. And remember that Pay Per Crawl, Cloudflare's HTTP 402 "pay-to-crawl" model that launched in private beta on June 1, 2025 and reached general availability in August 2025, is now a third path beyond block-or-allow: charge bots a micro-fee for access rather than refusing them outright.

08 — ImplicationsWhat this means for your team.

The crawler decision is not one-size-fits-all — it depends on what your content is worth as training data versus how much you stand to gain from AI-search referral. Four common profiles, four different calls.

Publishers · media

Licensable IP

Block all training crawlers including CCBot as a precautionary IP position, and consider Pay Per Crawl. Allow search bots so you keep citation visibility. This is the two-track posture the NMA letter is pushing toward.

Block training, court search

SaaS · docs sites

AI-search visibility

Lean toward allowing search and retrieval crawlers aggressively — AI answers are a discovery channel for documentation. Still block training bots, and publish an llms.txt to improve how models navigate your docs.

Maximize search access

Ecommerce

Conversion-led

AI referral traffic reportedly converts well, so allow search crawlers and measure the channel in GA4. Block training bots that scrape catalog and pricing data without sending shoppers back.

Allow search, block training

Any site

No policy today

If you have no AI crawler policy at all, you are in the ~97% majority and almost certainly being trained on by default. Start with the robots.txt block-training / allow-search config, then add a WAF layer for Bytespider.

Adopt the default now

Whichever profile fits, the mechanics are the same: separate training from search, enforce at the right layer, and measure the AI-referral channel so the decision is data-led rather than reflexive. If you want this configured and monitored as part of a broader technical-SEO program — robots.txt, WAF rules, llms.txt, and AI-search measurement in one engagement — that is exactly the work our agentic SEO service is built around, and crawler governance fits naturally alongside an AI transformation roadmap.

09 — ConclusionThe blanket block is the expensive mistake.

AI crawler access control, June 2026

Block training crawlers, keep your AI-search citations — they are different bots now.

The single most consequential shift in AI crawler control is that the major vendors split training from search. GPTBot is not OAI-SearchBot. ClaudeBot is not Claude-SearchBot. Amazonbot is not Amzn-SearchBot. That split is what makes a precise policy possible — and what makes the old block-everything advice a quiet, self-inflicted loss of visibility.

The economics back the precise approach. Cloudflare's data shows training driving the overwhelming majority of AI bot load, with crawl-to-referral ratios that, for some vendors, run into the tens of thousands of pages per visitor. Blocking the training half costs you almost nothing in referral traffic; keeping the search half open preserves the fastest-growing discovery channel of the year. The two decisions are finally separable, so make them separately.

Set the policy now. The defensible 2026 default is straightforward: disallow the training crawlers in robots.txt, allow the search and retrieval crawlers, enforce Bytespider and other non-compliant bots at the WAF, and treat llms.txt as the inference-navigation aid it is — never as a training opt-out. Roughly 97% of top sites still have no policy at all. Drawing the line cleanly this year is the cheap move that compounds.

AI Crawler Access Control: The 2026 Decision Matrix

01 — The Core SplitOne word changes the whole decision: purpose.

02 — The Decision MatrixEight crawlers, eight different calls.

03 — The EconomicsWhy the crawl-to-referral gap makes the case.

Crawl-to-referral ratio · lower is fairer to publishers

Where the load comes from

Year-over-year, reportedly

Of the top 1M sites

04 — Control LeversFive levers, ranked by enforcement strength.

05 — The llms.txt MythWhat llms.txt is not.

06 — The Legal FrontCCBot is now an IP dispute vector.

07 — The ConfigurationThe defensible 2026 default.

Training crawlers

Search crawlers

Non-compliant bots

08 — ImplicationsWhat this means for your team.

Licensable IP

AI-search visibility

Conversion-led

No policy today

09 — ConclusionThe blanket block is the expensive mistake.

Block training crawlers, keep your AI-search citations — they are different bots now.

Block the training crawlers, keep the citations that convert.

AI crawler & AI-search engagements

The questions we get every week.

Continue exploring technical SEO.

Agentic Crawler Behavior: 30-Day Site Log Study 2026

AI Crawler & Bot Traffic Statistics 2026: Key Data

Log File Analysis for SEO: 2026 Crawl-Budget Guide

Agentic Engine Optimization: Google's AEO Framework

Robots.txt and Meta Robots: Complete SEO Reference