AI crawler access control is no longer a single switch you flip on or off. Eight major AI crawlers each demand a separate decision, and the distinction that changes everything is purpose: a crawler that harvests your pages for model training is a fundamentally different actor from one that indexes you for AI search answers. Block them as one bucket and you can quietly delete yourself from the fastest-growing referral channel of 2026.
The reason this matters now is that the major AI vendors have split their crawlers in two. OpenAI runs GPTBot for training and OAI-SearchBot for ChatGPT search. Anthropic runs three separate bots. Amazon, Google, and Apple each separate training access from search and assistant access. The robots.txt rule that blocks one no longer blocks the other — which means the old "block all AI bots" advice is now actively harmful to visibility.
This guide gives you the full bot-by-bot decision matrix, the economics behind why blocking training crawlers makes sense, the five control levers ranked by enforcement strength, and a copy-ready 2026 configuration. Every claim is sourced to the operator's own documentation or to Cloudflare's published network research.
- 01Training and search are now separate bots.GPTBot ≠ OAI-SearchBot, ClaudeBot ≠ Claude-SearchBot, Amazonbot ≠ Amzn-SearchBot. Each has its own user-agent and can be controlled independently in robots.txt. Treating them as one bucket is the core mistake.
- 02Block training, keep search citations.The defensible default is to disallow training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended) while allowing search and retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) that send referral traffic.
- 03Anthropic is the crawl-to-referral outlier.Cloudflare measured Anthropic crawling roughly 70,900 pages per referred visitor at its June 2025 peak, versus about 5:1 for traditional Googlebot. That asymmetry is the economic case for blocking ClaudeBot training.
- 04WAF rules override robots.txt.A Cloudflare or firewall rule is enforced before robots.txt is even read, so a WAF block beats any robots.txt Allow. For non-compliant crawlers like Bytespider, IP/WAF blocking is the only reliable defense.
- 05llms.txt is not a training opt-out.llms.txt is a Markdown guide that helps LLMs navigate your site at inference time. It does not control training or crawl permissions. robots.txt governs access; llms.txt governs navigation. Do not confuse the two.
01 — The Core SplitOne word changes the whole decision: purpose.
Every AI crawler does one of three jobs. It collects pages for model training, it indexes pages for AI search answers, or it fetches a page in real time because a user asked the assistant a question right now. These are different commercial relationships, and as of 2026 the major vendors expose them as different bots with different user-agent strings.
OpenAI's documentation is explicit: GPTBot is the training crawler, and disallowing it "indicates a site's content should not be used in training generative AI foundation models." OAI-SearchBot is a separate crawler that builds the ChatGPT search index. Block OAI-SearchBot and your site will not appear in ChatGPT search answers, even though GPTBot and OAI-SearchBot are run by the same company. Anthropic goes further still, running three distinct bots — ClaudeBot for training, Claude-SearchBot for search indexing, and Claude-User for real-time user-initiated fetches — each independently controllable in robots.txt.
The practical consequence is that the instinctive move — a blanket Disallow: /for every AI user-agent — now does two things at once: it opts you out of training corpora (often the goal) and it removes you from AI search results (almost never the goal). Search Engine Journal's coverage of Anthropic's granular bot framework reports that roughly 71% of top news publishers block at least one retrieval or search bot, frequently while intending only to block training. That is the exact error this matrix is designed to prevent.
02 — The Decision MatrixEight crawlers, eight different calls.
Below is the flagship asset: a bot-by-bot matrix that pairs each crawler's exact user-agent string with its purpose and the recommended 2026 default. The pattern is consistent — block the training bots, allow the search and retrieval bots — with two exceptions worth understanding before you copy anything into production.
GPTBot/1.3OAI-SearchBot/1.3ClaudeBotClaude-SearchBotGoogle-ExtendedApplebot-Extended/1.0CCBot/2.0BytespiderPerplexityBot · Perplexity-UserAmazonbot/0.1 · Amzn-SearchBot| Crawler · user-agent | Operator · purpose | 2026 default |
|---|---|---|
GPTBot/1.3 | OpenAI · model training | BLOCK in robots.txt. Disallowing it signals content should not be used to train foundation models. Has no effect on ChatGPT search visibility. IP ranges at openai.com/gptbot.json. |
OAI-SearchBot/1.3 | OpenAI · search indexing | ALLOW. This is the crawler behind ChatGPT search answers. Block it and you disappear from ChatGPT search citations. IP ranges at openai.com/searchbot.json. |
ClaudeBot | Anthropic · model training | BLOCK. Anthropic honors Disallow and Crawl-delay and will not bypass CAPTCHAs. Verify IPs at claude.com/crawling/bots.json. (The old anthropic-ai and Claude-Web agents are deprecated.) |
Claude-SearchBot | Anthropic · search indexing | ALLOW. Separate from ClaudeBot. This is the bot that powers Claude's web answers and sends the small amount of referral traffic Anthropic does return. |
Google-Extended | Google · Gemini training (token) | BLOCK if opting out of Gemini training. It is a robots.txt control token, not an HTTP user-agent. Google states it does not affect Search inclusion or ranking. |
Applebot-Extended/1.0 | Apple · foundation-model training | BLOCK if opting out of Apple Intelligence training. Distinct from standard Applebot (Siri web results). Blocking it does not affect Apple Search or Spotlight. |
CCBot/2.0 | Common Crawl · open corpus | BLOCK. Its archive has trained nearly every major LLM, so robots.txt is the primary opt-out. Now also an IP-dispute vector after the April 2026 News/Media Alliance demand letter. |
Bytespider | ByteDance · model training | BLOCK at the WAF/IP level, not just robots.txt. Independently reported as inconsistently compliant, with no official docs, IP-range file, or robots.txt policy published. |
PerplexityBot · Perplexity-User | Perplexity · search + real-time | ALLOW both. PerplexityBot builds the search index; Perplexity-User fetches in real time. Blocking either removes you from Perplexity answers, a channel growing fast in 2026. |
Amazonbot/0.1 · Amzn-SearchBot | Amazon · training + search | BLOCK Amazonbot (may train AI models). ALLOW Amzn-SearchBot, which improves Alexa/Rufus search and explicitly does not crawl for generative AI training. |
03 — The EconomicsWhy the crawl-to-referral gap makes the case.
The business argument for blocking training crawlers comes down to a single ratio: how many of your pages a bot crawls for every one visitor it sends back. Cloudflare publishes this crawl-to-referral ratio across its network, and the spread between vendors is extraordinary. Traditional Googlebot sits at roughly 5 pages crawled per referral. Anthropic, at its June 2025 peak, was crawling about 70,900 pages for every visitor referred — an asymmetry that reframes ClaudeBot training access as a one-way extraction of value.
Crawl-to-referral ratio · lower is fairer to publishers
Source: Cloudflare network research, 2025Two caveats keep this honest. First, the 70,900:1 figure is Anthropic's peak in the week of June 19-26, 2025; by July 2025 it had improved substantially — reportedly by around 87% to roughly 38,000:1 — after Anthropic shipped web-search features. The direction of travel matters, but even the improved ratio is orders of magnitude worse than Googlebot. Second, Cloudflare's own framing of the broader trend is blunt about where this is heading.
"The trend continues to be more crawls and fewer referrals when compared in relation to each other."— Cloudflare, network crawl-data research, July 2025
The other half of the economics is the upside you protect by not blocking search crawlers. Cloudflare reports that training now drives roughly 82% of all AI bot activity (up from about 72% a year earlier) while search-based crawling fell to around 15%. That is the macro signal: the volume hammering your servers is overwhelmingly training, not the search indexing that sends traffic back. Meanwhile AI-referred traffic is reportedly growing fast and tends to convert better than generic organic search — so the search crawlers are the cheap, high-value half of the equation that the blanket-block crowd is throwing away.
Where the load comes from
Cloudflare measured training at roughly 82% of AI bot activity by July 2025, up from about 72% a year earlier, while search crawling fell to around 15%. Most of the burden is the half that gives nothing back.
Year-over-year, reportedly
AI referral traffic is reported to have grown roughly 975% from January 2025 to January 2026. The exact figure varies by source, but the trajectory is steep — which is why deleting yourself from AI search is costly.
Of the top 1M sites
As of July 2024, only about 2.98% of the top million sites on Cloudflare's network actively blocked AI bot requests, even though AI bots accessed roughly 39% of those properties. Most sites had no policy at all.
04 — Control LeversFive levers, ranked by enforcement strength.
robots.txt is the polite request layer — well-behaved crawlers honor it, but it has no teeth against bots that choose to ignore it. The second proprietary asset below ranks the five control mechanisms by how hard they actually enforce, because the critical detail is buried in Cloudflare's documentation: a WAF or firewall rule is evaluated before robots.txt is ever read, so a WAF block overrides any robots.txt Allow.
robots.txt DisallowX-Robots-Tag: noaiCloudflare AI Crawl ControlWAF / firewall custom ruleServer-level IP block| Control lever | Enforcement · scope | When to use it |
|---|---|---|
robots.txt Disallow | Voluntary · site-wide by user-agent | The primary lever for compliant bots (GPTBot, ClaudeBot, CCBot, Google-Extended). SEO-safe and free. Limitation: zero enforcement against bots that ignore it. |
X-Robots-Tag: noai | Voluntary · per-page or header | Page-level signal (noai/noimageai) some vendors honor. Useful for granular opt-outs. Limitation: a DeviantArt community convention, not an IETF/W3C standard, so reliability varies. |
Cloudflare AI Crawl Control | Hard block · per-crawler by purpose | Dashboard-managed rules that block by purpose category and report robots.txt-violation metrics. Creates a WAF rule on the zone. The pragmatic default for non-engineers. |
WAF / firewall custom rule | Hard block · enforced before robots.txt | The real teeth. Required for Bytespider and any crawler ignoring robots.txt. A WAF block overrides a robots.txt Allow because it runs first. Risk: misconfiguration can block humans. |
Server-level IP block | Hard block · granular by IP range | Lowest level, highest certainty when vendors publish IP-range files (OpenAI, Anthropic, Amazon, Common Crawl). Limitation: brittle as IP ranges rotate; needs maintenance. |
Allowfor OAI-SearchBot in robots.txt but a managed WAF rule is blocking "all AI crawlers," the WAF wins and you are still excluded from ChatGPT search. Order of evaluation matters: WAF first, robots.txt second. Always reconcile the two layers before assuming your search crawlers are getting through.For most teams the right combination is robots.txt for the compliant training bots, Cloudflare AI Crawl Control (or an equivalent managed ruleset) for purpose-level enforcement, and a targeted WAF rule for Bytespider specifically. Cloudflare's one-click "Block AI bots" managed rule, available on all plans including free since July 2024, is a reasonable floor — but verify that it is not silently blocking the search crawlers you want to keep. If you are auditing an existing setup, AI crawler directives belong as a dedicated category in any technical SEO audit checklist.
05 — The llms.txt MythWhat llms.txt is not.
One of the most persistent misconceptions in this space is that adding an /llms.txtfile gives you control over AI training. It does not. The llms.txt specification defines a Markdown file that helps an LLM efficiently navigate your site's content during a user session — it is an inference-time convenience, the equivalent of a curated sitemap written for a model rather than a search engine. It carries no access or training permissions whatsoever.
Keep the mental model clean: robots.txt governs crawl and access permissions; llms.txt governs how a model finds its way around once it is already reading your pages. Publishing llms.txt is a worthwhile move for AI-search visibility and answer quality — but if your goal is to opt out of training, llms.txt does nothing for you and robots.txt plus a WAF rule does everything. For the file format and how to structure it for inference-time navigation, see our companion guide to the llms.txt specification, and for the foundational access-control mechanics, the robots.txt and meta robots reference.
/llms.txt does not opt your site out of AI training. It is a navigation guide for inference, full stop. If a tool or vendor implies otherwise, treat that as a red flag.06 — The Legal FrontCCBot is now an IP dispute vector.
Common Crawl's CCBot has historically been treated as a passive archiver — a non-profit whose corpus happens to underpin nearly every major LLM, from GPT-class models to LLaMA and Mistral. That framing changed on April 29, 2026, when the News/Media Alliance sent a formal demand letter to Common Crawl's executive director, calling for removal of publisher content, revised terms explicitly prohibiting AI training use, and enforceable opt-out mechanisms. Signatories included NBCUniversal, CNN, McClatchy, Vox Media, Ziff Davis, and USA Today.
The practical takeaway is that blocking CCBot is no longer purely a technical-hygiene decision. For publishers, it has become a precautionary intellectual-property position — increasingly taken on legal advice — because the corpus is now contested ground. If your content has commercial value as licensable IP, disallowing CCBot in robots.txt is the documented opt-out, and doing so early establishes a clear record of intent.
07 — The ConfigurationThe defensible 2026 default.
Here is the configuration that follows from the matrix: block the training crawlers in robots.txt, allow the search and retrieval crawlers, and back it with a WAF layer for the bots that do not respect the file. Adjust per-site — a documentation-heavy SaaS may weigh AI-search visibility more heavily than a paywalled publisher guarding licensable IP — but this is the sensible starting point.
Training crawlers
GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended, Amazonbot. These collect for model training and give nothing back. robots.txt is the documented opt-out for all of them.
Search crawlers
OAI-SearchBot, Claude-SearchBot, PerplexityBot, Perplexity-User, Amzn-SearchBot. These index you for AI answers and send referral traffic. Keep them in.
Non-compliant bots
Bytespider, plus anything your logs show ignoring robots.txt. A WAF rule runs before robots.txt, so it actually enforces. Layer IP-range blocks where vendors publish them.
A few operational notes. Place the most specific user-agent rules first; some crawlers match the longest applicable directive, others the first. Keep the deprecated Anthropic agents (anthropic-ai, Claude-Web) out of your file — they are no longer active, and citing them gives readers and tools broken instructions. And remember that Pay Per Crawl, Cloudflare's HTTP 402 "pay-to-crawl" model that launched in private beta on June 1, 2025 and reached general availability in August 2025, is now a third path beyond block-or-allow: charge bots a micro-fee for access rather than refusing them outright.
08 — ImplicationsWhat this means for your team.
The crawler decision is not one-size-fits-all — it depends on what your content is worth as training data versus how much you stand to gain from AI-search referral. Four common profiles, four different calls.
Licensable IP
Block all training crawlers including CCBot as a precautionary IP position, and consider Pay Per Crawl. Allow search bots so you keep citation visibility. This is the two-track posture the NMA letter is pushing toward.
AI-search visibility
Lean toward allowing search and retrieval crawlers aggressively — AI answers are a discovery channel for documentation. Still block training bots, and publish an llms.txt to improve how models navigate your docs.
Conversion-led
AI referral traffic reportedly converts well, so allow search crawlers and measure the channel in GA4. Block training bots that scrape catalog and pricing data without sending shoppers back.
No policy today
If you have no AI crawler policy at all, you are in the ~97% majority and almost certainly being trained on by default. Start with the robots.txt block-training / allow-search config, then add a WAF layer for Bytespider.
Whichever profile fits, the mechanics are the same: separate training from search, enforce at the right layer, and measure the AI-referral channel so the decision is data-led rather than reflexive. If you want this configured and monitored as part of a broader technical-SEO program — robots.txt, WAF rules, llms.txt, and AI-search measurement in one engagement — that is exactly the work our agentic SEO service is built around, and crawler governance fits naturally alongside an AI transformation roadmap.
09 — ConclusionThe blanket block is the expensive mistake.
Block training crawlers, keep your AI-search citations — they are different bots now.
The single most consequential shift in AI crawler control is that the major vendors split training from search. GPTBot is not OAI-SearchBot. ClaudeBot is not Claude-SearchBot. Amazonbot is not Amzn-SearchBot. That split is what makes a precise policy possible — and what makes the old block-everything advice a quiet, self-inflicted loss of visibility.
The economics back the precise approach. Cloudflare's data shows training driving the overwhelming majority of AI bot load, with crawl-to-referral ratios that, for some vendors, run into the tens of thousands of pages per visitor. Blocking the training half costs you almost nothing in referral traffic; keeping the search half open preserves the fastest-growing discovery channel of the year. The two decisions are finally separable, so make them separately.
Set the policy now. The defensible 2026 default is straightforward: disallow the training crawlers in robots.txt, allow the search and retrieval crawlers, enforce Bytespider and other non-compliant bots at the WAF, and treat llms.txt as the inference-navigation aid it is — never as a training opt-out. Roughly 97% of top sites still have no policy at all. Drawing the line cleanly this year is the cheap move that compounds.