SEOMethodology11 min readPublished May 30, 2026

What crawlers really do · GPTBot +305% · the training-vs-retrieval split

Log File Analysis for SEO: The 2026 Crawl-Budget Guide

Server logs are the only record of what crawlers actually did on your site — not a simulation of what they might do. This 2026 guide covers crawl-budget mechanics, how to read the combined log format, and the new three-way split between indexation, training, and AI-search retrieval bots that every robots.txt now has to account for.

DA
Digital Applied Team
Technical SEO · Published May 30, 2026
PublishedMay 30, 2026
Read time11 min
SourcesGoogle, Cloudflare, OpenAI
GPTBot crawler traffic
+305%
May 2024 → May 2025
Cloudflare
Googlebot share of bot traffic
30→50%
of all crawler traffic, 2025
Top domains with AI robots.txt rules
14%
of top 10,000, mid-2025
AI bots' share of HTML requests
4.2%
avg across network, 2025
peak 6.4%

Log file analysis is the only SEO technique that shows what crawlers actually did on your site, request by request, rather than what a third-party tool simulates they might do. Every other source — Search Console, crawl emulators, rank trackers — is a model of crawler behaviour. The server log is the ground truth.

That distinction matters more in 2026 than it has at any point in the discipline's history. The bot population hitting a typical site has fractured: Googlebot and Bingbot still crawl for the index, but they now share the access log with training crawlers like GPTBot and ClaudeBot, and a separate class of retrieval bots — OAI-SearchBot, Claude-SearchBot, PerplexityBot — that fetch pages live to answer questions inside AI assistants. Each class wants a different robots.txt decision, and the log is the only place you can see which ones are actually reaching you.

This guide is a working reference. It explains how Google defines and spends crawl budget, how to read the combined log format field by field, the three-way AI crawler split that is unique to this era, how to verify that a request claiming to be Googlebot really is, and how to translate all of that into crawl-budget recovery. Everything is sourced to primary documentation — Google Search Central, the bot operators' own docs, and Cloudflare's network-scale data.

Key takeaways
  1. 01
    Logs are the only record of real crawler behaviour.Search Console shows what Google indexed; it does not record which URLs Googlebot requested or when. Server logs are the only data source for request-level bot activity — every other tool is a simulation.
  2. 02
    Crawl budget is capacity plus demand.Google defines it as the URLs it can and wants to crawl: a crawl-capacity limit (connections plus fetch delay) and crawl demand (popularity, staleness, perceived inventory). It only matters for large or fast-changing sites.
  3. 03
    AI crawlers now need their own accounting.Per Cloudflare, GPTBot raw requests grew +305% year over year and AI bots averaged 4.2% of all HTML requests in 2025. A log that lumps them in with Googlebot hides the decision you actually have to make.
  4. 04
    Training, retrieval, and indexation are three buckets.Blocking GPTBot opts out of training; blocking OAI-SearchBot removes you from ChatGPT search results; blocking Googlebot removes you from the index. Same access log, three completely different robots.txt strategies.
  5. 05
    User-agent strings can be spoofed — verify with IPs.Cloudflare documented Perplexity running an undeclared crawler behind a generic Chrome user-agent after its declared bot was blocked. Reverse DNS or IP-list cross-check is the only reliable way to confirm a bot is what it claims.

01Why LogsWhat Search Console can't tell you.

Google Search Console is indispensable, but it answers a narrow question: what is the indexed state of my URLs? Its Coverage report and URL Inspection tool tell you whether a page is indexed, excluded, or discovered-not-crawled — but they do not record which URLs Googlebot actually requested, in what order, or how often. For that request-level picture, the server log is the only source.

A log file is, in the plainest terms, the web server's own record of every request it received. Each line captures who asked for what, when, and how the server responded. Because it is written by the server at the moment of the request, it cannot be gamed by a crawl emulator's assumptions or a sampling window — it is the complete, unedited account of crawler activity on the origin.

Search Console shows what Google has indexed, but it does not show what Google actually does on your website. For that, you need a log file analysis.Visively, Log File Analysis for Technical SEO

The gap is widest exactly where it hurts most. A page that is crawled but never indexed, an orphan page that gets bot traffic despite having no internal links, a section quietly throwing 5xx errors to Googlebot only — none of these are visible in the indexing reports, and all of them are obvious in the log. Treating the log as a complement to Search Console rather than a replacement is the correct framing: the two answer different questions.

02Crawl BudgetCapacity, demand, and who actually needs to care.

Google defines crawl budget as the set of URLs that Google can and wants to crawl, and it is governed by two components. The crawl-capacity limitis how much Google can fetch without overloading your server — the maximum number of simultaneous connections and the delay between fetches, adjusted up or down based on your server's observed response health. Crawl demandis how much Google wants to crawl, driven by a URL's popularity, its staleness, and Google's perception of your overall inventory.

The honest framing — and Google's own — is that crawl budget is not a problem most sites need to manage. Google's guidance flags it as a priority for large sites (roughly a million-plus unique pages with weekly changes), medium-to-large sites (in the order of ten thousand-plus pages changing daily), or any site with a meaningful pile of "Discovered — currently not indexed" URLs in Search Console. Those page counts are rough estimates, not hard thresholds. If you run a 200-page brochure site, crawl budget is not your bottleneck.

Google's own warning
Do not use noindex to manage crawl budget. Google will still request the page to read the directive, so the crawl quota is spent either way. To keep URLs out of the crawl queue entirely, use robots.txt to block the path, or return a 404 / 410 for content that is genuinely gone.

This is the most common crawl-budget mistake we see in audits: teams sprinkle noindex across faceted URLs or thin archive pages expecting it to relieve crawl pressure, then wonder why the log still shows Googlebot hammering those paths. The directive controls indexation, not crawling. If the goal is to stop the crawl, the decision belongs in robots.txt or in the response status code, and the log is where you confirm the change actually landed.

03Reading the LogEvery field in the combined log, mapped to an SEO use.

The Apache/Nginx combined log format is the de-facto standard for SEO analysis, and the W3C Extended format used by Microsoft IIS carries the same information. Once you can read one line, you can read the whole file. Each request records the client IP, the timestamp, the request line, the HTTP status, the bytes sent, the referrer, and the user-agent. The table below is our field-to-insight reference: what each field is, and the SEO diagnosis it unlocks.

Log field
$remote_addr
What it captures
Client IP address
SEO diagnosis it enables
Bot verification. The IP is what you cross-check against the operator's published range or via reverse DNS — the only reliable way to confirm a self-declared Googlebot is genuine.
Log field
$time_local
What it captures
Request timestamp
SEO diagnosis it enables
Crawl frequency and recency. Shows how often a URL or template is crawled, whether a new page was discovered, and how quickly Googlebot returns after a change.
Log field
$request
What it captures
Method + URL + protocol
SEO diagnosis it enables
Crawl allocation by template. Group the request paths to see how much budget goes to product pages vs faceted navigation, internal search, or pagination.
Log field
$status
What it captures
HTTP response code
SEO diagnosis it enables
Error and waste detection. Spikes in 4xx/5xx served to bots, redirect chains via 3xx, and 304 (Not Modified) responses that preserve budget all surface here.
Log field
$body_bytes_sent
What it captures
Response size in bytes
SEO diagnosis it enables
Payload bloat. Unusually large responses to bots can flag uncompressed pages or rendered bloat that slows crawl rate and wastes capacity.
Log field
$http_referer
What it captures
Referring URL
SEO diagnosis it enables
Internal-link path discovery. For bot requests this is sparse, but for human traffic it helps separate organic from referral when correlating logs with analytics.
Log field
$http_user_agent
What it captures
Self-declared client string
SEO diagnosis it enables
Bot classification — the starting point, never the end. The string tells you which crawler claims to be visiting; it must be verified by IP because it is trivially spoofed.

Two status codes deserve their own note because they are routinely misread. An HTTP 304 (Not Modified) is a good sign: Googlebot asked whether cached content had changed, the server said no, and the body was not re-fetched — budget preserved for pages that did change. An HTTP 410 (Gone) is processed faster than a 404 for URLs you have intentionally retired, dropping them from the index more quickly. Persistent 503 errors are the opposite of harmless: sustained over days or weeks, they cause Google to reduce crawl frequency and eventually drop the affected URLs.

04The AI SplitThree kinds of bot, three different decisions.

The defining change to log analysis in this era is that "bot traffic" is no longer one thing. A pre-2024 guide could treat every crawler as an indexing bot. That assumption is now wrong. The modern access log contains three functionally distinct classes, and conflating them hides the only decision that matters.

Indexation
Googlebot & Bingbot
Crawl → index → rank

The classic search crawlers. Blocking them removes you from the traditional search index. This is the bucket every crawl-budget guide has always been about — and the one you almost never want to restrict.

Block = invisible in search
Training
GPTBot & ClaudeBot
Fetch → model training data

These collect content to train foundation models. Per OpenAI, disallowing GPTBot signals that your content should not be used in training. Blocking them is a data-rights decision, not a visibility one.

Block = opt out of training
Retrieval
OAI-SearchBot & PerplexityBot
Live fetch → AI answer citation

These fetch pages in real time to answer questions inside ChatGPT search, Claude search, and Perplexity. Blocking them removes you from AI-search results — increasingly a real referral channel.

Block = invisible in AI search

The interpretation we draw from the network-scale data is that this split is not a future concern — it is already material. Per Cloudflare, AI bots (excluding Googlebot) averaged 4.2% of all HTML requests across its network in 2025, peaking at 6.4% in late June. Within the AI category the growth has been wildly uneven: GPTBot raw requests grew +305% year over year, raising its share of crawler traffic from 2.2% to 7.7%, while ClaudeBot fell −46% over the same window. PerplexityBot grew +157,490% — but from near zero, which is why that figure is misleading on its own; in absolute terms PerplexityBot still crawled only a tiny fraction of sampled pages against Googlebot's share.

Looking forward, the trend line that matters is the one most teams are not watching. Only about 14% of the top 10,000 domains had any AI-specific robots.txt rules as of mid-2025, which means the overwhelming majority of sites are making the training-vs-retrieval decision by accident. As AI-search referral grows into a measurable channel, the sites that have already separated these buckets in their logs will be the ones that can make the allow/block call deliberately, per bot, instead of discovering the consequences after the fact.

AI crawler traffic, year over year · Cloudflare network data

Source: Cloudflare, From Googlebot to GPTBot (May 2024–May 2025)
GooglebotIndexation · share of all crawler traffic rose 30% → 50%
~50%
GPTBotTraining · +305% YoY · share 2.2% → 7.7%
+305%
ChatGPT-UserUser-triggered retrieval · year-over-year growth
+2,825%
ClaudeBotTraining · share fell 11.7% → 5.4%
−46%
AI bots overallExcluding Googlebot · share of all HTML requests
4.2% avg

05Decision MatrixThe AI crawler decision matrix, 2026 edition.

This is the reference we wish existed when we started doing this work: every major AI crawler in one place, with the one thing most guides omit — what blocking it actually does to you — set beside the verification method and the robots.txt token you would use. Every user-agent token and JSON URL below is taken from the operator's own documentation; confirm the current values before you commit a rule, because operators do update them.

Crawler · purpose
Googlebot · Google · indexation
Effect of blocking
Removed from Google Search index — almost never what you want.
Verify · robots.txt token
Reverse DNS (googlebot.com) + IP-range JSON · token Googlebot
Crawler · purpose
Google-Extended · Google · Gemini/Vertex training
Effect of blocking
Opts out of Gemini/Vertex AI training. No impact on Google Search rankings.
Verify · robots.txt token
Same Googlebot fetcher · token Google-Extended (separate from Googlebot)
Crawler · purpose
GPTBot · OpenAI · training
Effect of blocking
Opts your content out of OpenAI foundation-model training data.
Verify · robots.txt token
IP JSON openai.com/gptbot.json · token GPTBot
Crawler · purpose
OAI-SearchBot · OpenAI · retrieval
Effect of blocking
Removes your site from ChatGPT search results.
Verify · robots.txt token
IP JSON openai.com/searchbot.json · token OAI-SearchBot
Crawler · purpose
ChatGPT-User · OpenAI · user-triggered
Effect of blocking
Blocks fetches a user explicitly asked ChatGPT to make on your page.
Verify · robots.txt token
Published IP ranges · token ChatGPT-User
Crawler · purpose
ClaudeBot · Anthropic · training
Effect of blocking
Opts your content out of Anthropic model training.
Verify · robots.txt token
IP list claude.com/crawling/bots.json · no reverse-DNS pattern · token ClaudeBot
Crawler · purpose
Claude-SearchBot · Anthropic · retrieval
Effect of blocking
Removes you from Claude search results.
Verify · robots.txt token
IP list (no PTR pattern) · token Claude-SearchBot
Crawler · purpose
PerplexityBot · Perplexity · retrieval
Effect of blocking
Removes you from Perplexity answers. Retrieval only — not used for training.
Verify · robots.txt token
IP JSON perplexity.com/perplexitybot.json · token PerplexityBot
Crawler · purpose
Perplexity-User · Perplexity · user-triggered
Effect of blocking
Per Perplexity's own docs, this agent generally ignores robots.txt — block at the edge if required.
Verify · robots.txt token
IP JSON perplexity.com/perplexity-user.json · robots.txt unreliable
The verification gap to remember
Google and OpenAI both support IP-based verification, and Google publishes a reverse-DNS pattern (googlebot.com). Anthropic publishes an IP list but no reverse-DNS PTR pattern — so for ClaudeBot, matching against claude.com/crawling/bots.json is the only programmatic verification available. A request claiming to be ClaudeBot with an IP outside that list is a spoof.

06VerificationTrust the IP, never the user-agent.

The user-agent string is self-declared, which means it is trivially forged. Any scraper can send a header that reads like Googlebot. Counting those requests as genuine search-engine crawl inflates your numbers and corrupts every downstream conclusion. The fix is to verify the originating IP, and there are two official methods.

Reverse-DNS verificationis Google's documented two-step process. First, run a reverse lookup on the request IP with host [IP]; for a genuine Googlebot the resulting domain should be googlebot.com, google.com, or googleusercontent.com. Second, run a forward lookup on that domain and confirm it resolves back to the same IP. Both steps must pass — a one-way match is not sufficient.

IP-list verification is the at-scale method. Google, OpenAI, and Anthropic each publish machine-readable JSON files of their crawler IP ranges, so you can validate a request programmatically against the published ranges without a DNS query per line. For ClaudeBot this is the only option, since Anthropic does not publish a reverse-DNS pattern.

Why this stopped being optional
In 2024 Cloudflare documented Perplexity running an undeclared crawler behind a generic Chrome user-agent after its declared PerplexityBot was blocked — on the order of millions of requests a day, observed across tens of thousands of domains. Cloudflare de-listed Perplexity as a verified bot in response. The lesson is concrete: user-agent matching alone cannot tell you who is really on your site. IP cross-checking is the floor.

Practically, your verification pass should run before any analysis. Tag every bot line in the log as verified or unverified by IP, then do all crawl-budget math on the verified set only. Unverified "Googlebot" traffic is its own finding — often a scraper or a spoofed AI crawler — and it belongs in the security review, not the SEO crawl-allocation chart. For the broader workflow of separating real bots from impersonators across a site, our agentic SEO service builds this verification step into ongoing monitoring rather than one-off audits.

07Finding WasteWhere the crawl leaks.

Once the log is filtered to verified search crawlers, the crawl-allocation question is simple to state and revealing to answer: what share of Googlebot's requests landed on URLs you actually want indexed? Group the requests by template and the leaks tend to announce themselves. The usual suspects are faceted-navigation permutations, internal search-result pages, deep pagination, session-ID and tracking parameters, and orphan pages that no internal link points to yet still draw bot traffic.

Industry estimates put the scale of this on large sites in the range of 30–50% of crawl budget consumed by non-essential pages, though the exact figure varies widely by site and the specific sourcing is inconsistent — treat it as an order-of-magnitude expectation, not a benchmark. The point is directional: on a big, parameter-heavy site, a substantial slice of Googlebot's effort is routinely spent on URLs that will never earn a ranking, and that slice is invisible to any tool that is not reading the real log.

Faceted navigation
Combinatorial URL blowup
1/m

Filter-and-sort permutations multiply into thousands of low-value URLs. In log analyses these often dominate Googlebot's requests while contributing nothing to the index. Restrict via robots.txt and canonicals, then re-check the log.

Top crawl-waste source
Orphan pages
Crawled but unlinked
0links

Pages a crawler reaches that have no internal links pointing to them. The log is the only place orphans with bot traffic show up — crawl emulators that follow links never find them.

Log-only discovery
Status-code drift
Errors served to bots only
5xx

A section throwing intermittent 5xx or redirect chains to Googlebot can be invisible to users and to Search Console's sampled reports. Sustained 503s reduce crawl frequency and eventually drop URLs.

Crawl-rate suppressor
Read case-study numbers as illustrative
Vendor write-ups cite dramatic recoveries — one e-commerce case where Googlebot spent the majority of its budget on never-indexed faceted URLs until they were restricted, and a retailer found to be wasting the overwhelming bulk of its crawl allocation, discoverable only in the logs. These are vendor-stated, single-site figures, not independently verified benchmarks. The durable takeaway is the mechanism, not the percentage: the waste was invisible to every external simulation and visible only in the server log.

The remediation loop is the same regardless of scale. Identify the wasteful template in the log, restrict it at the correct layer (robots.txt for crawl, a status code for retired content, canonicals and internal-linking changes for consolidation), then return to the log a few weeks later to confirm Googlebot redistributed its effort toward the pages you care about. Crawl frequency is also a function of internal link density, which is why log work pairs so naturally with an internal linking strategy for large sites — and why slow server response, the kind you would chase in a Core Web Vitals optimization pass, can suppress crawl rate by signalling capacity limits.

08ToolingFrom a spreadsheet to enterprise scale.

The right tool depends on log volume and how often you need to do this. The decision is less about features than about how many log events you are processing and whether crawl monitoring is continuous or occasional.

Small site · occasional
Spreadsheet or scripts

For modest logs and one-off audits, a filtered spreadsheet or a short parsing script does the job. You handle bot verification manually against the published IP ranges. Free, but it does not scale past a few thousand lines.

Start here
Mid-size · recurring
Dedicated log analyser

A purpose-built tool like Screaming Frog's Log File Analyser handles Apache, IIS, and Nginx formats plus ELB custom formats, auto-verifies bot legitimacy, and surfaces orphan pages and response-code inconsistencies. Free up to 1,000 events; an inexpensive annual licence above that.

The practical default
Enterprise · continuous
Botify or Lumar

Platforms that process millions of log events at scale, correlate logs with crawl data and analytics, and segment by page template — giving site-wide crawl-allocation views a spreadsheet cannot produce. Justified when crawl budget is genuinely a constraint.

For large sites

Whatever the tool, the workflow is identical: collect a representative log window, parse it into the standard fields, verify every bot line by IP, filter to the crawler class you are analysing, group by template, and compare against your intended crawl priorities. The tool changes; the method does not. Log analysis is one station in a larger technical-SEO pipeline, and it slots cleanly into a technical SEO audit checklist — and it pairs especially well with our companion 30-day site log study, which puts the methodology in this guide against real-world AI crawler data.

Log file analysis provides better insight than any other external crawl tool available.Builtvisible, The ultimate guide to log file analysis for SEO

09ConclusionThe log is the ground truth.

Technical SEO, May 2026

Server logs are where crawler reality and SEO intent finally meet.

Log file analysis has always been the technique that separates what you think crawlers do from what they actually do. In 2026 that gap is wider than ever, because the bots reaching your origin no longer share a single purpose. The same access log now carries indexation crawlers, training crawlers, and AI-search retrieval crawlers — and the right robots.txt decision is different for each one.

The practical sequence is unchanged in shape and richer in detail. Verify every bot by IP, because user-agent strings are forged and the Perplexity stealth-crawler episode proved the cost of trusting them. Filter to the crawler class you care about. Map crawl allocation against intent, and recover the budget leaking into faceted navigation, orphan pages, and error-throwing templates. Then return to the log to confirm the change landed — the one verification step no simulation can give you.

The forward-looking move is to start accounting for the AI buckets now, while only a small minority of sites do. AI-search referral is becoming a measurable channel, training opt-outs are a real data-rights lever, and both decisions are only legible in the log. The teams that separate these three buckets today will make deliberate, per-bot calls tomorrow — instead of discovering, after the fact, that an accidental block cost them visibility in the surfaces their customers are starting to search from.

Make your logs work for SEO

Recover wasted crawl budget and own your AI crawler strategy.

Our team runs server-log analysis as part of technical SEO and agentic search engagements — verifying bots, recovering wasted crawl budget, and setting a deliberate training-vs-retrieval crawler strategy for AI search visibility.

Free consultationExpert guidanceTailored solutions
What we work on

Technical SEO log engagements

  • Bot verification — separating real crawlers from spoofers
  • Crawl-budget recovery on large, parameter-heavy sites
  • Training-vs-retrieval AI crawler robots.txt strategy
  • Log-to-analytics correlation for crawl-allocation views
  • Continuous crawl monitoring, not one-off audits
FAQ · Log file analysis

The questions we get every week.

Log file analysis is the practice of reading a web server's access log — its own record of every request it received — to see exactly what search engine and AI crawlers did on your site. Each line records the client IP, timestamp, requested URL, HTTP status, bytes sent, referrer, and user-agent. It matters because it is the only data source for request-level crawler behaviour: which URLs were crawled, how often, and what response they got. Every other tool, including Search Console and crawl emulators, is a model of crawler behaviour rather than a record of it. For large or fast-changing sites, the log is where you find crawl-budget waste, orphan pages, and bot-only errors that no other source reveals.