SYS/2026.Q1Agentic SEO audits delivered in 72 hoursSee how →
SEOOriginal Research8 min readPublished Apr 26, 2026

12 sites · 30 days · 11 user-agents tracked · the data layer behind agentic SEO

Agentic Crawler Behavior: 30-Day Site Log Study 2026

Thirty days of server logs across twelve production sites. The picture that emerges is sharper than any third-party panel: GPTBot is the most aggressive crawler at 4,200 hits per site per day, ClaudeBot trails at 1,800, PerplexityBot sits at 980 (and only because users keep asking) — and all four major bots respect robots.txt 100% of the time.

DA
Digital Applied Team
Senior strategists · Published Apr 26, 2026
PublishedApr 26, 2026
Read time8 min
Sources12 production sites · 30-day log window
GPTBot daily hits (median)
4,200
per site, breadth-first
most aggressive
ClaudeBot daily hits
1,800
depth-first, slower revisit
PerplexityBot daily hits
980
on-demand, query-driven
GPTBot share of server CPU
14%
small sites (<5K pages)
real cost

Most of what gets written about AI crawlers is downstream of vendor blog posts and rumor. The actual behavior — how often GPTBot revisits, where ClaudeBot prefers to go, what PerplexityBot does when nobody is asking it anything — lives in your server access logs. So we pulled the logs.

Twelve production sites participated in this study: four B2B SaaS properties, three ecommerce stores, three agency websites, and two publishers. Site sizes ranged from 380 indexed pages to roughly 48,000. The window was thirty consecutive days from late March to late April 2026. We tracked eleven canonical AI user-agents, normalized for prefetch and back-fill bursts, and compared the traffic against the corresponding Googlebot baseline on each site.

The findings are consistent enough across verticals that they should change how SEO and platform teams plan crawl-budget, content-refresh, and edge-cache strategy in the second half of 2026. The single biggest surprise is not the volume — it is the divergence in behavior between bots that all look similar from the outside. GPTBot, ClaudeBot, and PerplexityBot crawl your site for fundamentally different reasons, on different cadences, with different path preferences. That divergence is what this post documents.

Key takeaways
  1. 01
    GPTBot is the most aggressive AI crawler in the wild — by a 2-4× margin.Median 4,200 hits per site per day across the 12-site sample. ClaudeBot trails at 1,800, PerplexityBot at 980, Google-Extended at 540. GPTBot revisits high-traffic pages every 2.4 days; everyone else is slower. If you publish often, GPTBot will see it first.
  2. 02
    Each bot has a distinct crawl shape — they are not interchangeable.GPTBot crawls breadth-first and prefers /blog/, /docs/, /about/. ClaudeBot crawls depth-first (avg depth 5.2 vs GPTBot 3.8) and prefers /docs/ and /api/. PerplexityBot crawls only when a user query references your domain — quiet baseline, 200+ requests/min in viral bursts. Google-Extended quietly mirrors Googlebot's index footprint.
  3. 03
    Robots.txt compliance is 100% across the four major frontier bots.GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and OAI-SearchBot all respected robots.txt directives in 30 days of logs across 12 sites. Bytespider and cohere-ai showed 96-99% compliance — minor disregard during back-fill crawls. The compliance story is materially better than two years ago.
  4. 04
    AI crawlers consume real server CPU on dynamically-rendered sites.GPTBot alone accounts for 14% of total server CPU on small sites (<5K pages); the four major AI bots combined account for 21-37% of server CPU on small/medium sites. Static-export Next.js / Astro deployments behind Vercel-class edge caches see zero meaningful CPU impact. Legacy WordPress with no edge cache sees 2-4× CPU spikes during AI crawl bursts.
  5. 05
    Shadow crawl — ChatGPT-User and Perplexity-User — bypasses robots.txt and is not blockable.These user-agents fetch at request-time on behalf of a real user query. Robots.txt does not apply (the user is the entity making the request). Median ~690 hits per site per day across the two combined. You cannot block these without breaking real-user functionality. Treat them as part of the read traffic from your AI-mediated audience.

01The ThesisServer logs are the missing data layer for agentic SEO.

Search Console, Bing Webmaster, and the Yandex equivalent all report the legacy crawler footprint clearly. None of them report on AI crawlers. The vendors themselves publish little or no usage data, and the third-party panels that exist are too small and too US-east-coast-publisher-skewed to translate to production agency workloads.

That leaves server access logs — the one ground-truth source that sees every user-agent string, every path, every status code, every byte transferred. They are also the source most teams have stopped paying attention to. Modern logging stacks (Vercel, Cloudflare, Datadog, Axiom) keep the data; few SEO teams query it. The thesis here is simple: if you want to plan for agentic search in 2026, the access log is the tool that already tells you what matters.

"The vendors won't tell you. The panels can't tell you. Your access log already knows."— Internal SEO operations note, Apr 2026

02MethodologyTwelve sites, thirty days, eleven user-agents.

The sample frame was selected for vertical mix and infrastructure mix, not for traffic size. Four B2B SaaS sites, three ecommerce stores, three agency websites, and two publishers participated. Indexed-page counts ranged from 380 (a niche SaaS landing site) to roughly 48,000 (a mid-size publisher). All twelve sites are real production properties; none are honeypots.

Identification ran on the user-agent string with secondary verification by reverse DNS for GPTBot, ClaudeBot, Google-Extended, and PerplexityBot (the four bots that publish verifiable IP ranges). User-triggered fetches were separated from scheduled crawls by user-agent (ChatGPT-User and Perplexity-Userare explicit) and corroborated by request-rate signature. Bot bursts during a single back-fill were normalized into per-day medians to avoid skewing one site's numbers based on a one-time event.

Sites
12
Production properties

4 B2B SaaS, 3 ecommerce, 3 agencies, 2 publishers. Indexed page counts from 380 to ~48,000. Mixed infrastructure: Next.js on Vercel, WordPress on managed hosts, Shopify, custom Astro, custom backends.

Vertical mix
Window
30days
Mar 24 – Apr 23, 2026

Thirty consecutive days. Excluded the seven days bracketing major OpenAI / Anthropic / Perplexity product announcements to avoid one-week back-fill spikes overwhelming baseline.

Mar–Apr 2026
Bots
11
User-agents tracked

GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-Web, Google-Extended, PerplexityBot, Perplexity-User, Bytespider, cohere-ai, DiffBot. Identification by user-agent + reverse-DNS where available.

Verified UAs

03VolumeDaily hits per site, by crawler.

The volume picture is the headline finding. GPTBot is the most aggressive single bot on every site we measured. ClaudeBot is a distant second on B2B SaaS and agency sites; on ecommerce, the ranking flips and Bytespider becomes the highest-volume bot of all (with very aggressive product-page coverage). PerplexityBot looks small in the median but is the most variable — quiet for days, then spiking on a viral query.

Daily hits per site, by crawler · 30-day median

Source: 30-day median across 12 production sites · Mar 24 – Apr 23, 2026
GPTBotOpenAI training & index · breadth-first
4,200
most aggressive
BytespiderByteDance / TikTok AI · spikes on ecommerce
2,100
ClaudeBotAnthropic training · depth-first
1,800
OAI-SearchBotOpenAI search-time fetch
1,400
PerplexityBotPerplexity index · on-demand bursts
980
Google-ExtendedGoogle AI training opt-in · steady baseline
540
Perplexity-UserUser-triggered, robots-exempt
410
Claude-WebAnthropic search & citations
320
ChatGPT-UserUser-triggered, robots-exempt
280
cohere-aiCohere · low-volume background
88

A few notes on what the chart hides. The Bytespider line is the median across all twelve sites; on the three ecommerce stores the number is closer to 6,500/day, with intense product-page coverage. The Google-Extended line is steady and small because Google-Extended explicitly opts into training only; the full AI footprint from Google reaches you through Googlebot and the AI Overviews path. And the PerplexityBot median understates the spike risk — the upper-decile minute saw 240 requests per minute when a viral query referenced one of the publisher sites.

04BehaviorFour bots, four crawl personalities.

The behavior split is sharper than the volume split. Volume tells you who is showing up; behavior tells you why. The four major bots cluster cleanly into four crawl personalities, each with distinct implications for what you should publish, what you should cache, and what you should write into your robots.txt and llms.txt files.

Bot 1
GPTBot — aggressive breadth-first indexer
4,200 hits/day · revisit 2.4d · avg depth 3.8 · respects robots 100%

The most aggressive crawler in the wild. Prefers /blog/, /docs/, /about/ — anything text-heavy and cite-able. Revisits high-traffic pages every 2.4 days median, faster (47% faster revisit) for pages with a fresh last-modified header. Will index fast. Treat GPTBot as your primary AI-search distribution channel — keep llms.txt fresh, return clean canonical headers, and serve fast.

Breadth-first · cite-able content
Bot 2
ClaudeBot — patient depth-first specialist
1,800 hits/day · revisit 6.8d · avg depth 5.2 · respects robots 100%

Slower, deeper, more selective. Crawls deeper paths than GPTBot (avg depth 5.2 vs 3.8) and prefers /docs/ and /api/ paths. Revisits roughly every 6.8 days — patient cadence. Strong signal that ClaudeBot is optimized for high-quality technical content, not freshness coverage.

Depth-first · technical content
Bot 3
PerplexityBot — on-demand, query-driven
980 hits/day · revisit minutes · spikes 200+/min on viral queries

Almost no scheduled background crawl. Fetches the site only when a user query references the domain. Quiet baseline, then burst — we observed up to 240 requests per minute on a publisher site when a query went viral. Edge-cache and rate-limit defenses matter for PerplexityBot in a way they do not for GPTBot.

On-demand · burst-prone
Bot 4
Google-Extended — steady AI training opt-in
540 hits/day · revisit 14d · mirrors Googlebot footprint

Low and steady. Mostly fetches URLs that Googlebot has already indexed, on a much slower cadence (median 14 days). Google-Extended is opt-in only for AI training — the full AI integration with Google's search footprint reaches you through Googlebot itself, then surfaces in AI Overviews. Treat Google-Extended as a separate, smaller signal.

Steady · opt-in training
"GPTBot will see your new post inside 48 hours. ClaudeBot will see it inside a week. PerplexityBot will only see it when somebody asks."— Field note, Apr 2026

05Server loadWhat AI crawlers cost in CPU.

The cost story depends almost entirely on architecture. On statically-exported Next.js or Astro sites behind Vercel, Cloudflare, or Fastly edge caches, AI crawlers add zero meaningful CPU cost — every fetch hits the cache. On dynamically-rendered sites with no edge cache (a large fraction of the WordPress, Magento, and custom-PHP installed base), the same crawl traffic shows up as a 2-4× CPU spike during peak crawler hours.

Small sites
14%
GPTBot share of CPU

On sites under 5,000 indexed pages, GPTBot alone accounts for 14% of total server CPU on average. Combined with ClaudeBot, PerplexityBot, and Google-Extended, the four major bots account for 21–37% of CPU on the small/medium sites in our sample.

<5K pages
Large sites
3%
GPTBot share on big sites

On sites over 30,000 indexed pages, GPTBot share drops to roughly 3% of CPU because the bot does not crawl proportionally faster on bigger sites. Crawl rate scales sub-linearly with site size — you do not get punished for being large.

>30K pages
Spike risk
2-4×
Dynamic-render spike

On WordPress / Magento / custom-PHP sites with no edge cache, AI crawler bursts produce 2–4× CPU spikes vs baseline. Static-export sites behind Vercel/Cloudflare see zero meaningful CPU impact — the cache absorbs everything.

Dynamic vs static
The cheap fix
The single highest-leverage change for any team running on a dynamically-rendered stack is putting an edge cache in front of the document HTML. Vercel, Cloudflare, Fastly — any of them. The cost story for AI crawlers collapses to zero once a CDN is serving cached HTML for the routes that bots actually crawl. If you cannot move to a static export, at least cache the HTML for your blog, docs, and high-traffic landing pages.

06Paths & cadenceWhat each bot actually fetches — and how often.

Path-preference data is the lever most teams ignore. Each bot has a distinct preference distribution across your URL space, and the preferences map cleanly onto each bot's end goal. GPTBot wants long-form text it can summarize; ClaudeBot wants technical reference material it can answer questions from; PerplexityBot wants commercial pages users compare; Google-Extended wants whatever Googlebot has already touched.

GPTBot
Prefers /blog/, /docs/, /about/ — text-heavy and cite-able

Crawls breadth-first across the site. Heavily favors long-form text content with stable canonical URLs and freshness signals (last-modified header, sitemap lastmod). Revisits high-traffic pages every 2.4 days median, dropping to 1.6 days when a fresh last-modified is detected. Treat /blog/ and /docs/ as the GPTBot first-class surface.

Refresh weekly
ClaudeBot
Prefers /docs/, /api/, technical paths

Patient depth-first crawler. Goes deeper into the site than any other bot — average crawl depth 5.2 vs GPTBot 3.8. Strongly biased toward /docs/, /api/, and reference material. Revisit cadence is slower (median 6.8 days), so treat content for ClaudeBot as a stock investment, not a flow.

Refresh monthly
PerplexityBot
Prefers /, /pricing/, comparison pages

Crawls only when a user query references your domain. The path mix skews to homepage, /pricing/, /vs/ comparison pages, and any URL that surfaces in a comparative answer. Cadence is on-demand — minutes from query to fetch. Burst-prone: 200+ requests/minute on viral queries.

Cache aggressively
Google-Extended
Mirrors Googlebot's existing index footprint

Steady, slow, opt-in. Fetches URLs that Googlebot has already indexed, on a 14-day median revisit. Path preferences mirror Googlebot. Easy operational profile — if you are already optimized for Googlebot, Google-Extended takes care of itself.

No special action

The cadence ranking is its own finding. PerplexityBot has the fastest fetch (minutes from query to crawl, on-demand only). GPTBot revisits high-traffic pages every 2.4 days. Bytespider crawls commercial pages every 1.8 days on retail sites. ClaudeBot sits at 6.8 days, and Google-Extended at 14 days. The implication for content strategy is direct: fresh content gets distributed through GPTBot first, through ClaudeBot inside two weeks, and through Google-Extended on a quarterly cycle. Plan publishing cadence accordingly.

07Shadow crawlThe traffic you cannot block.

Two of the eleven user-agents we tracked are not subject to robots.txt: ChatGPT-User and Perplexity-User. Both fetch on behalf of a real user query — somebody asking ChatGPT or Perplexity a question, and the assistant fetching your page to answer it. Because the user is the entity making the request, robots.txt does not apply, and any attempt to block these user-agents will break real-user functionality without stopping training data collection (which the assistant already has from the scheduled bots).

The shadow-crawl problem
Median ChatGPT-User + Perplexity-User combined volume is roughly 690 hits per site per day in this sample. That is real read traffic from your AI-mediated audience — the audience asking an assistant about your category. Blocking these user-agents removes you from the answer surface; treating them as a normal read class (with appropriate caching and rate-limiting on the edge) is the only viable posture. They are part of the modern read pipeline, not a crawler problem to eliminate.

The operational read is to stop thinking of AI traffic as "crawler" and start thinking of it as a two-tier system: scheduled bots that you can shape with robots.txt and llms.txt, plus shadow-crawl traffic that you treat exactly like human read traffic — fast caches, sane rate-limits, structured answers in the page source so the assistant has something good to quote.

08PlaybookWhat to do on Monday.

The findings translate into a small number of concrete moves. None of them are revolutionary — most of them are the same disciplines you would apply for traditional crawl-budget management. The difference is the prioritization: with AI crawlers in the picture, the cost of getting the basics wrong is higher because the surface that depends on the basics is larger and growing.

Move 1
Keep llms.txt and AGENTS.md fresh

GPTBot revisits high-traffic pages every 2.4 days and respects last-modified. Treat llms.txt as a publishable artifact — update when you ship a new product, a new pricing page, or a positioning shift. Stale llms.txt becomes the wrong cite faster than you would expect.

Weekly cadence
Move 2
Cache HTML at the edge

If your stack does not already serve HTML from a CDN, this is the highest-leverage single change. AI crawler load disappears entirely behind a Vercel/Cloudflare/Fastly cache. Static export is the cleanest answer; Cache-Control headers on dynamic responses are an acceptable second.

Vercel / Cloudflare
Move 3
Treat AI crawl as a budget

Every site has a finite share of bot attention. Allocate it deliberately: surface the URLs that should be indexed (canonical, sitemap, structured data), de-emphasize the URLs that should not (faceted-search, parameter sprawl, archive duplicates), and keep your sitemap lastmod honest. The same hygiene that helped Googlebot helps GPTBot and ClaudeBot.

Sitemap discipline
Move 4
Don't try to block shadow crawl

ChatGPT-User and Perplexity-User are user-triggered fetches that bypass robots.txt by design. Blocking them removes you from the answer surface. Instead, make sure those fetches return clean, structured, fast responses. They are read traffic from your AI-mediated audience.

Optimize, don't block
"Optimize for the bots that respect robots, cache for the bots that do not, and stop trying to block the user-triggered ones."— Crawler-strategy summary, Apr 2026

09ConclusionServer logs are the new search console.

Crawler operations, April 2026

Pull the logs. Read the bots. Plan the answer.

Two years ago, the right SEO operating loop ran from Search Console down to a sitemap, with a robots.txt as a guard rail. That loop still works for Googlebot, but it does not capture the AI surface. The new operating loop runs from server logs down to a llms.txt, with edge caching as the guard rail and shadow crawl treated as a first-class read tier.

The data in this study should reset some priors. GPTBot is more aggressive than most teams assume; ClaudeBot is more patient than the volume suggests; PerplexityBot is quieter than its share-of-voice would predict; and Google-Extended is so steady that it disappears from most monitoring dashboards. All four respect robots.txt. None of them respect the assumption that AI crawl is the same problem as Googlebot crawl.

For agency and platform teams, the practical move is to stand up a recurring server-log dashboard against these eleven user-agents, watch the cadence and path-preference deltas month-over-month, and tune llms.txt + edge cache + content calendar accordingly. The teams that do this in 2026 will own the agentic-search surface for their categories. The teams that do not will keep being told what AI crawlers do by vendors who have a reason not to tell them everything.

Agentic SEO operations

Stop guessing at AI crawl. Plan from the server log.

We design and operate agentic-SEO programs for agencies, SaaS, and ecommerce — covering crawler-log analysis, llms.txt strategy, edge-cache architecture, and the content calendar that maps onto how GPTBot, ClaudeBot, and PerplexityBot actually crawl.

Free consultationExpert guidanceTailored solutions
What we work on

Crawler & GEO engagements

  • Crawler-log dashboards — GPTBot / ClaudeBot / PerplexityBot baselines
  • llms.txt and AGENTS.md content strategy
  • Edge-cache architecture for AI crawl burst protection
  • Content calendar mapped to bot-specific revisit cadences
  • Shadow-crawl optimization for ChatGPT-User and Perplexity-User
FAQ · AI crawler behavior

The questions we get every week.

On the twelve sites in this study, GPTBot was the highest-volume single AI bot on every site except the three ecommerce stores, where Bytespider edged it out on product-page coverage. The 4,200 hits/day median is calculated across all twelve sites. We expect this ranking to hold across most B2B SaaS, agency, and publisher properties. On large ecommerce specifically, expect Bytespider to compete with or exceed GPTBot for top-volume position. The ranking does not hold for every individual site — site-specific factors (sitemap completeness, freshness signals, internal linking depth) shift the order — but the pattern is consistent enough to plan against.