Most of what gets written about AI crawlers is downstream of vendor blog posts and rumor. The actual behavior — how often GPTBot revisits, where ClaudeBot prefers to go, what PerplexityBot does when nobody is asking it anything — lives in your server access logs. So we pulled the logs.

Twelve production sites participated in this study: four B2B SaaS properties, three ecommerce stores, three agency websites, and two publishers. Site sizes ranged from 380 indexed pages to roughly 48,000. The window was thirty consecutive days from late March to late April 2026. We tracked eleven canonical AI user-agents, normalized for prefetch and back-fill bursts, and compared the traffic against the corresponding Googlebot baseline on each site.

The findings are consistent enough across verticals that they should change how SEO and platform teams plan crawl-budget, content-refresh, and edge-cache strategy in the second half of 2026. The single biggest surprise is not the volume — it is the divergence in behavior between bots that all look similar from the outside. GPTBot, ClaudeBot, and PerplexityBot crawl your site for fundamentally different reasons, on different cadences, with different path preferences. That divergence is what this post documents.

Key takeaways

01
GPTBot is the most aggressive AI crawler in the wild — by a 2-4× margin.Median 4,200 hits per site per day across the 12-site sample. ClaudeBot trails at 1,800, PerplexityBot at 980, Google-Extended at 540. GPTBot revisits high-traffic pages every 2.4 days; everyone else is slower. If you publish often, GPTBot will see it first.
02
Each bot has a distinct crawl shape — they are not interchangeable.GPTBot crawls breadth-first and prefers /blog/, /docs/, /about/. ClaudeBot crawls depth-first (avg depth 5.2 vs GPTBot 3.8) and prefers /docs/ and /api/. PerplexityBot crawls only when a user query references your domain — quiet baseline, 200+ requests/min in viral bursts. Google-Extended quietly mirrors Googlebot's index footprint.
03
Robots.txt compliance is 100% across the four major frontier bots.GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and OAI-SearchBot all respected robots.txt directives in 30 days of logs across 12 sites. Bytespider and cohere-ai showed 96-99% compliance — minor disregard during back-fill crawls. The compliance story is materially better than two years ago.
04
AI crawlers consume real server CPU on dynamically-rendered sites.GPTBot alone accounts for 14% of total server CPU on small sites (<5K pages); the four major AI bots combined account for 21-37% of server CPU on small/medium sites. Static-export Next.js / Astro deployments behind Vercel-class edge caches see zero meaningful CPU impact. Legacy WordPress with no edge cache sees 2-4× CPU spikes during AI crawl bursts.
05
Shadow crawl — ChatGPT-User and Perplexity-User — bypasses robots.txt and is not blockable.These user-agents fetch at request-time on behalf of a real user query. Robots.txt does not apply (the user is the entity making the request). Median ~690 hits per site per day across the two combined. You cannot block these without breaking real-user functionality. Treat them as part of the read traffic from your AI-mediated audience.

01 — The ThesisServer logs are the missing data layer for agentic SEO.

Search Console, Bing Webmaster, and the Yandex equivalent all report the legacy crawler footprint clearly. None of them report on AI crawlers. The vendors themselves publish little or no usage data, and the third-party panels that exist are too small and too US-east-coast-publisher-skewed to translate to production agency workloads.

That leaves server access logs — the one ground-truth source that sees every user-agent string, every path, every status code, every byte transferred. They are also the source most teams have stopped paying attention to. Modern logging stacks (Vercel, Cloudflare, Datadog, Axiom) keep the data; few SEO teams query it. The thesis here is simple: if you want to plan for agentic search in 2026, the access log is the tool that already tells you what matters.

"The vendors won't tell you. The panels can't tell you. Your access log already knows."— Internal SEO operations note, Apr 2026

02 — MethodologyTwelve sites, thirty days, eleven user-agents.

The sample frame was selected for vertical mix and infrastructure mix, not for traffic size. Four B2B SaaS sites, three ecommerce stores, three agency websites, and two publishers participated. Indexed-page counts ranged from 380 (a niche SaaS landing site) to roughly 48,000 (a mid-size publisher). All twelve sites are real production properties; none are honeypots.

Identification ran on the user-agent string with secondary verification by reverse DNS for GPTBot, ClaudeBot, Google-Extended, and PerplexityBot (the four bots that publish verifiable IP ranges). User-triggered fetches were separated from scheduled crawls by user-agent (ChatGPT-User and Perplexity-Userare explicit) and corroborated by request-rate signature. Bot bursts during a single back-fill were normalized into per-day medians to avoid skewing one site's numbers based on a one-time event.

Sites

Production properties

4 B2B SaaS, 3 ecommerce, 3 agencies, 2 publishers. Indexed page counts from 380 to ~48,000. Mixed infrastructure: Next.js on Vercel, WordPress on managed hosts, Shopify, custom Astro, custom backends.

Vertical mix

Window

30days

Mar 24 – Apr 23, 2026

Thirty consecutive days. Excluded the seven days bracketing major OpenAI / Anthropic / Perplexity product announcements to avoid one-week back-fill spikes overwhelming baseline.

Mar–Apr 2026

Bots

User-agents tracked

GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-Web, Google-Extended, PerplexityBot, Perplexity-User, Bytespider, cohere-ai, DiffBot. Identification by user-agent + reverse-DNS where available.

Verified UAs

03 — VolumeDaily hits per site, by crawler.

The volume picture is the headline finding. GPTBot is the most aggressive single bot on every site we measured. ClaudeBot is a distant second on B2B SaaS and agency sites; on ecommerce, the ranking flips and Bytespider becomes the highest-volume bot of all (with very aggressive product-page coverage). PerplexityBot looks small in the median but is the most variable — quiet for days, then spiking on a viral query.

Daily hits per site, by crawler · 30-day median

Source: 30-day median across 12 production sites · Mar 24 – Apr 23, 2026

GPTBotOpenAI training & index · breadth-first

4,200

most aggressive

BytespiderByteDance / TikTok AI · spikes on ecommerce

2,100

ClaudeBotAnthropic training · depth-first

1,800

OAI-SearchBotOpenAI search-time fetch

1,400

PerplexityBotPerplexity index · on-demand bursts

980

Google-ExtendedGoogle AI training opt-in · steady baseline

540

Perplexity-UserUser-triggered, robots-exempt

410

Claude-WebAnthropic search & citations

320

ChatGPT-UserUser-triggered, robots-exempt

280

cohere-aiCohere · low-volume background

A few notes on what the chart hides. The Bytespider line is the median across all twelve sites; on the three ecommerce stores the number is closer to 6,500/day, with intense product-page coverage. The Google-Extended line is steady and small because Google-Extended explicitly opts into training only; the full AI footprint from Google reaches you through Googlebot and the AI Overviews path. And the PerplexityBot median understates the spike risk — the upper-decile minute saw 240 requests per minute when a viral query referenced one of the publisher sites.

04 — BehaviorFour bots, four crawl personalities.

The behavior split is sharper than the volume split. Volume tells you who is showing up; behavior tells you why. The four major bots cluster cleanly into four crawl personalities, each with distinct implications for what you should publish, what you should cache, and what you should write into your robots.txt and llms.txt files.

Bot 1

GPTBot — aggressive breadth-first indexer

4,200 hits/day · revisit 2.4d · avg depth 3.8 · respects robots 100%

The most aggressive crawler in the wild. Prefers /blog/, /docs/, /about/ — anything text-heavy and cite-able. Revisits high-traffic pages every 2.4 days median, faster (47% faster revisit) for pages with a fresh last-modified header. Will index fast. Treat GPTBot as your primary AI-search distribution channel — keep llms.txt fresh, return clean canonical headers, and serve fast.

Breadth-first · cite-able content

Bot 2

ClaudeBot — patient depth-first specialist

1,800 hits/day · revisit 6.8d · avg depth 5.2 · respects robots 100%

Slower, deeper, more selective. Crawls deeper paths than GPTBot (avg depth 5.2 vs 3.8) and prefers /docs/ and /api/ paths. Revisits roughly every 6.8 days — patient cadence. Strong signal that ClaudeBot is optimized for high-quality technical content, not freshness coverage.

Depth-first · technical content

Bot 3

PerplexityBot — on-demand, query-driven

980 hits/day · revisit minutes · spikes 200+/min on viral queries

Almost no scheduled background crawl. Fetches the site only when a user query references the domain. Quiet baseline, then burst — we observed up to 240 requests per minute on a publisher site when a query went viral. Edge-cache and rate-limit defenses matter for PerplexityBot in a way they do not for GPTBot.

On-demand · burst-prone

Bot 4

Google-Extended — steady AI training opt-in

540 hits/day · revisit 14d · mirrors Googlebot footprint

Low and steady. Mostly fetches URLs that Googlebot has already indexed, on a much slower cadence (median 14 days). Google-Extended is opt-in only for AI training — the full AI integration with Google's search footprint reaches you through Googlebot itself, then surfaces in AI Overviews. Treat Google-Extended as a separate, smaller signal.

Steady · opt-in training

"GPTBot will see your new post inside 48 hours. ClaudeBot will see it inside a week. PerplexityBot will only see it when somebody asks."— Field note, Apr 2026

05 — Server loadWhat AI crawlers cost in CPU.

The cost story depends almost entirely on architecture. On statically-exported Next.js or Astro sites behind Vercel, Cloudflare, or Fastly edge caches, AI crawlers add zero meaningful CPU cost — every fetch hits the cache. On dynamically-rendered sites with no edge cache (a large fraction of the WordPress, Magento, and custom-PHP installed base), the same crawl traffic shows up as a 2-4× CPU spike during peak crawler hours.

Small sites

14%

GPTBot share of CPU

On sites under 5,000 indexed pages, GPTBot alone accounts for 14% of total server CPU on average. Combined with ClaudeBot, PerplexityBot, and Google-Extended, the four major bots account for 21–37% of CPU on the small/medium sites in our sample.

<5K pages

Large sites

GPTBot share on big sites

On sites over 30,000 indexed pages, GPTBot share drops to roughly 3% of CPU because the bot does not crawl proportionally faster on bigger sites. Crawl rate scales sub-linearly with site size — you do not get punished for being large.

>30K pages

Spike risk

2-4×

Dynamic-render spike

On WordPress / Magento / custom-PHP sites with no edge cache, AI crawler bursts produce 2–4× CPU spikes vs baseline. Static-export sites behind Vercel/Cloudflare see zero meaningful CPU impact — the cache absorbs everything.

Dynamic vs static

The cheap fix

The single highest-leverage change for any team running on a dynamically-rendered stack is putting an edge cache in front of the document HTML. Vercel, Cloudflare, Fastly — any of them. The cost story for AI crawlers collapses to zero once a CDN is serving cached HTML for the routes that bots actually crawl. If you cannot move to a static export, at least cache the HTML for your blog, docs, and high-traffic landing pages.

06 — Paths & cadenceWhat each bot actually fetches — and how often.

Path-preference data is the lever most teams ignore. Each bot has a distinct preference distribution across your URL space, and the preferences map cleanly onto each bot's end goal. GPTBot wants long-form text it can summarize; ClaudeBot wants technical reference material it can answer questions from; PerplexityBot wants commercial pages users compare; Google-Extended wants whatever Googlebot has already touched.

GPTBot

Prefers /blog/, /docs/, /about/ — text-heavy and cite-able

Crawls breadth-first across the site. Heavily favors long-form text content with stable canonical URLs and freshness signals (last-modified header, sitemap lastmod). Revisits high-traffic pages every 2.4 days median, dropping to 1.6 days when a fresh last-modified is detected. Treat /blog/ and /docs/ as the GPTBot first-class surface.

Refresh weekly

ClaudeBot

Prefers /docs/, /api/, technical paths

Patient depth-first crawler. Goes deeper into the site than any other bot — average crawl depth 5.2 vs GPTBot 3.8. Strongly biased toward /docs/, /api/, and reference material. Revisit cadence is slower (median 6.8 days), so treat content for ClaudeBot as a stock investment, not a flow.

Refresh monthly

PerplexityBot

Prefers /, /pricing/, comparison pages

Crawls only when a user query references your domain. The path mix skews to homepage, /pricing/, /vs/ comparison pages, and any URL that surfaces in a comparative answer. Cadence is on-demand — minutes from query to fetch. Burst-prone: 200+ requests/minute on viral queries.

Cache aggressively

Google-Extended

Mirrors Googlebot's existing index footprint

Steady, slow, opt-in. Fetches URLs that Googlebot has already indexed, on a 14-day median revisit. Path preferences mirror Googlebot. Easy operational profile — if you are already optimized for Googlebot, Google-Extended takes care of itself.

No special action

The cadence ranking is its own finding. PerplexityBot has the fastest fetch (minutes from query to crawl, on-demand only). GPTBot revisits high-traffic pages every 2.4 days. Bytespider crawls commercial pages every 1.8 days on retail sites. ClaudeBot sits at 6.8 days, and Google-Extended at 14 days. The implication for content strategy is direct: fresh content gets distributed through GPTBot first, through ClaudeBot inside two weeks, and through Google-Extended on a quarterly cycle. Plan publishing cadence accordingly.

07 — Shadow crawlThe traffic you cannot block.

Two of the eleven user-agents we tracked are not subject to robots.txt: ChatGPT-User and Perplexity-User. Both fetch on behalf of a real user query — somebody asking ChatGPT or Perplexity a question, and the assistant fetching your page to answer it. Because the user is the entity making the request, robots.txt does not apply, and any attempt to block these user-agents will break real-user functionality without stopping training data collection (which the assistant already has from the scheduled bots).

The shadow-crawl problem

Median ChatGPT-User + Perplexity-User combined volume is roughly 690 hits per site per day in this sample. That is real read traffic from your AI-mediated audience — the audience asking an assistant about your category. Blocking these user-agents removes you from the answer surface; treating them as a normal read class (with appropriate caching and rate-limiting on the edge) is the only viable posture. They are part of the modern read pipeline, not a crawler problem to eliminate.

The operational read is to stop thinking of AI traffic as "crawler" and start thinking of it as a two-tier system: scheduled bots that you can shape with robots.txt and llms.txt, plus shadow-crawl traffic that you treat exactly like human read traffic — fast caches, sane rate-limits, structured answers in the page source so the assistant has something good to quote.

08 — PlaybookWhat to do on Monday.

The findings translate into a small number of concrete moves. None of them are revolutionary — most of them are the same disciplines you would apply for traditional crawl-budget management. The difference is the prioritization: with AI crawlers in the picture, the cost of getting the basics wrong is higher because the surface that depends on the basics is larger and growing.

Move 1

Keep llms.txt and AGENTS.md fresh

GPTBot revisits high-traffic pages every 2.4 days and respects last-modified. Treat llms.txt as a publishable artifact — update when you ship a new product, a new pricing page, or a positioning shift. Stale llms.txt becomes the wrong cite faster than you would expect.

Weekly cadence

Move 2

Cache HTML at the edge

If your stack does not already serve HTML from a CDN, this is the highest-leverage single change. AI crawler load disappears entirely behind a Vercel/Cloudflare/Fastly cache. Static export is the cleanest answer; Cache-Control headers on dynamic responses are an acceptable second.

Vercel / Cloudflare

Move 3

Treat AI crawl as a budget

Every site has a finite share of bot attention. Allocate it deliberately: surface the URLs that should be indexed (canonical, sitemap, structured data), de-emphasize the URLs that should not (faceted-search, parameter sprawl, archive duplicates), and keep your sitemap lastmod honest. The same hygiene that helped Googlebot helps GPTBot and ClaudeBot.

Sitemap discipline

Move 4

Don't try to block shadow crawl

ChatGPT-User and Perplexity-User are user-triggered fetches that bypass robots.txt by design. Blocking them removes you from the answer surface. Instead, make sure those fetches return clean, structured, fast responses. They are read traffic from your AI-mediated audience.

Optimize, don't block

"Optimize for the bots that respect robots, cache for the bots that do not, and stop trying to block the user-triggered ones."— Crawler-strategy summary, Apr 2026

09 — ConclusionServer logs are the new search console.

Crawler operations, April 2026

Pull the logs. Read the bots. Plan the answer.

Two years ago, the right SEO operating loop ran from Search Console down to a sitemap, with a robots.txt as a guard rail. That loop still works for Googlebot, but it does not capture the AI surface. The new operating loop runs from server logs down to a llms.txt, with edge caching as the guard rail and shadow crawl treated as a first-class read tier.

The data in this study should reset some priors. GPTBot is more aggressive than most teams assume; ClaudeBot is more patient than the volume suggests; PerplexityBot is quieter than its share-of-voice would predict; and Google-Extended is so steady that it disappears from most monitoring dashboards. All four respect robots.txt. None of them respect the assumption that AI crawl is the same problem as Googlebot crawl.

For agency and platform teams, the practical move is to stand up a recurring server-log dashboard against these eleven user-agents, watch the cadence and path-preference deltas month-over-month, and tune llms.txt + edge cache + content calendar accordingly. The teams that do this in 2026 will own the agentic-search surface for their categories. The teams that do not will keep being told what AI crawlers do by vendors who have a reason not to tell them everything.

Agentic Crawler Behavior: 30-Day Site Log Study 2026