Log file analysis is the only SEO technique that shows what crawlers actually did on your site, request by request, rather than what a third-party tool simulates they might do. Every other source — Search Console, crawl emulators, rank trackers — is a model of crawler behaviour. The server log is the ground truth.

That distinction matters more in 2026 than it has at any point in the discipline's history. The bot population hitting a typical site has fractured: Googlebot and Bingbot still crawl for the index, but they now share the access log with training crawlers like GPTBot and ClaudeBot, and a separate class of retrieval bots — OAI-SearchBot, Claude-SearchBot, PerplexityBot — that fetch pages live to answer questions inside AI assistants. Each class wants a different robots.txt decision, and the log is the only place you can see which ones are actually reaching you.

This guide is a working reference. It explains how Google defines and spends crawl budget, how to read the combined log format field by field, the three-way AI crawler split that is unique to this era, how to verify that a request claiming to be Googlebot really is, and how to translate all of that into crawl-budget recovery. Everything is sourced to primary documentation — Google Search Central, the bot operators' own docs, and Cloudflare's network-scale data.

Key takeaways

01
Logs are the only record of real crawler behaviour.Search Console shows what Google indexed; it does not record which URLs Googlebot requested or when. Server logs are the only data source for request-level bot activity — every other tool is a simulation.
02
Crawl budget is capacity plus demand.Google defines it as the URLs it can and wants to crawl: a crawl-capacity limit (connections plus fetch delay) and crawl demand (popularity, staleness, perceived inventory). It only matters for large or fast-changing sites.
03
AI crawlers now need their own accounting.Per Cloudflare, GPTBot raw requests grew +305% year over year and AI bots averaged 4.2% of all HTML requests in 2025. A log that lumps them in with Googlebot hides the decision you actually have to make.
04
Training, retrieval, and indexation are three buckets.Blocking GPTBot opts out of training; blocking OAI-SearchBot removes you from ChatGPT search results; blocking Googlebot removes you from the index. Same access log, three completely different robots.txt strategies.
05
User-agent strings can be spoofed — verify with IPs.Cloudflare documented Perplexity running an undeclared crawler behind a generic Chrome user-agent after its declared bot was blocked. Reverse DNS or IP-list cross-check is the only reliable way to confirm a bot is what it claims.

01 — Why LogsWhat Search Console can't tell you.

Google Search Console is indispensable, but it answers a narrow question: what is the indexed state of my URLs? Its Coverage report and URL Inspection tool tell you whether a page is indexed, excluded, or discovered-not-crawled — but they do not record which URLs Googlebot actually requested, in what order, or how often. For that request-level picture, the server log is the only source.

A log file is, in the plainest terms, the web server's own record of every request it received. Each line captures who asked for what, when, and how the server responded. Because it is written by the server at the moment of the request, it cannot be gamed by a crawl emulator's assumptions or a sampling window — it is the complete, unedited account of crawler activity on the origin.

Search Console shows what Google has indexed, but it does not show what Google actually does on your website. For that, you need a log file analysis.Visively, Log File Analysis for Technical SEO

The gap is widest exactly where it hurts most. A page that is crawled but never indexed, an orphan page that gets bot traffic despite having no internal links, a section quietly throwing 5xx errors to Googlebot only — none of these are visible in the indexing reports, and all of them are obvious in the log. Treating the log as a complement to Search Console rather than a replacement is the correct framing: the two answer different questions.

02 — Crawl BudgetCapacity, demand, and who actually needs to care.

Google defines crawl budget as the set of URLs that Google can and wants to crawl, and it is governed by two components. The crawl-capacity limit is how much Google can fetch without overloading your server — the maximum number of simultaneous connections and the delay between fetches, adjusted up or down based on your server's observed response health. Crawl demand is how much Google wants to crawl, driven by a URL's popularity, its staleness, and Google's perception of your overall inventory.

The honest framing — and Google's own — is that crawl budget is not a problem most sites need to manage. Google's guidance flags it as a priority for large sites (roughly a million-plus unique pages with weekly changes), medium-to-large sites (in the order of ten thousand-plus pages changing daily), or any site with a meaningful pile of "Discovered — currently not indexed" URLs in Search Console. Those page counts are rough estimates, not hard thresholds. If you run a 200-page brochure site, crawl budget is not your bottleneck.

Google's own warning

Do not use noindex to manage crawl budget. Google will still request the page to read the directive, so the crawl quota is spent either way. To keep URLs out of the crawl queue entirely, use robots.txt to block the path, or return a 404 / 410 for content that is genuinely gone.

This is the most common crawl-budget mistake we see in audits: teams sprinkle noindex across faceted URLs or thin archive pages expecting it to relieve crawl pressure, then wonder why the log still shows Googlebot hammering those paths. The directive controls indexation, not crawling. If the goal is to stop the crawl, the decision belongs in robots.txt or in the response status code, and the log is where you confirm the change actually landed.

03 — Reading the LogEvery field in the combined log, mapped to an SEO use.

The Apache/Nginx combined log format is the de-facto standard for SEO analysis, and the W3C Extended format used by Microsoft IIS carries the same information. Once you can read one line, you can read the whole file. Each request records the client IP, the timestamp, the request line, the HTTP status, the bytes sent, the referrer, and the user-agent. The table below is our field-to-insight reference: what each field is, and the SEO diagnosis it unlocks.

Log field

$remote_addr

What it captures

Client IP address

SEO diagnosis it enables

Bot verification. The IP is what you cross-check against the operator's published range or via reverse DNS — the only reliable way to confirm a self-declared Googlebot is genuine.

Log field

$time_local

What it captures

Request timestamp

SEO diagnosis it enables

Crawl frequency and recency. Shows how often a URL or template is crawled, whether a new page was discovered, and how quickly Googlebot returns after a change.

Log field

$request

What it captures

Method + URL + protocol

SEO diagnosis it enables

Crawl allocation by template. Group the request paths to see how much budget goes to product pages vs faceted navigation, internal search, or pagination.

Log field

$status

What it captures

HTTP response code

SEO diagnosis it enables

Error and waste detection. Spikes in 4xx/5xx served to bots, redirect chains via 3xx, and 304 (Not Modified) responses that preserve budget all surface here.

Log field

$body_bytes_sent

What it captures

Response size in bytes

SEO diagnosis it enables

Payload bloat. Unusually large responses to bots can flag uncompressed pages or rendered bloat that slows crawl rate and wastes capacity.

Log field

$http_referer

What it captures

Referring URL

SEO diagnosis it enables

Internal-link path discovery. For bot requests this is sparse, but for human traffic it helps separate organic from referral when correlating logs with analytics.

Log field

$http_user_agent

What it captures

Self-declared client string

SEO diagnosis it enables

Bot classification — the starting point, never the end. The string tells you which crawler claims to be visiting; it must be verified by IP because it is trivially spoofed.

Log field	What it captures	SEO diagnosis it enables
`$remote_addr`	Client IP address	Bot verification. The IP is what you cross-check against the operator's published range or via reverse DNS — the only reliable way to confirm a self-declared Googlebot is genuine.
`$time_local`	Request timestamp	Crawl frequency and recency. Shows how often a URL or template is crawled, whether a new page was discovered, and how quickly Googlebot returns after a change.
`$request`	Method + URL + protocol	Crawl allocation by template. Group the request paths to see how much budget goes to product pages vs faceted navigation, internal search, or pagination.
`$status`	HTTP response code	Error and waste detection. Spikes in 4xx/5xx served to bots, redirect chains via 3xx, and 304 (Not Modified) responses that preserve budget all surface here.
`$body_bytes_sent`	Response size in bytes	Payload bloat. Unusually large responses to bots can flag uncompressed pages or rendered bloat that slows crawl rate and wastes capacity.
`$http_referer`	Referring URL	Internal-link path discovery. For bot requests this is sparse, but for human traffic it helps separate organic from referral when correlating logs with analytics.
`$http_user_agent`	Self-declared client string	Bot classification — the starting point, never the end. The string tells you which crawler claims to be visiting; it must be verified by IP because it is trivially spoofed.

Two status codes deserve their own note because they are routinely misread. An HTTP 304 (Not Modified) is a good sign: Googlebot asked whether cached content had changed, the server said no, and the body was not re-fetched — budget preserved for pages that did change. An HTTP 410 (Gone) is processed faster than a 404 for URLs you have intentionally retired, dropping them from the index more quickly. Persistent 503 errors are the opposite of harmless: sustained over days or weeks, they cause Google to reduce crawl frequency and eventually drop the affected URLs.

04 — The AI SplitThree kinds of bot, three different decisions.

The defining change to log analysis in this era is that "bot traffic" is no longer one thing. A pre-2024 guide could treat every crawler as an indexing bot. That assumption is now wrong. The modern access log contains three functionally distinct classes, and conflating them hides the only decision that matters.

Indexation

Googlebot & Bingbot

Crawl → index → rank

The classic search crawlers. Blocking them removes you from the traditional search index. This is the bucket every crawl-budget guide has always been about — and the one you almost never want to restrict.

Block = invisible in search

Training

GPTBot & ClaudeBot

Fetch → model training data

These collect content to train foundation models. Per OpenAI, disallowing GPTBot signals that your content should not be used in training. Blocking them is a data-rights decision, not a visibility one.

Block = opt out of training

Retrieval

OAI-SearchBot & PerplexityBot

Live fetch → AI answer citation

These fetch pages in real time to answer questions inside ChatGPT search, Claude search, and Perplexity. Blocking them removes you from AI-search results — increasingly a real referral channel.

Block = invisible in AI search

The interpretation we draw from the network-scale data is that this split is not a future concern — it is already material. Per Cloudflare, AI bots (excluding Googlebot) averaged 4.2% of all HTML requests across its network in 2025, peaking at 6.4% in late June. Within the AI category the growth has been wildly uneven: GPTBot raw requests grew +305% year over year, raising its share of crawler traffic from 2.2% to 7.7%, while ClaudeBot fell −46% over the same window. PerplexityBot grew +157,490% — but from near zero, which is why that figure is misleading on its own; in absolute terms PerplexityBot still crawled only a tiny fraction of sampled pages against Googlebot's share.

Looking forward, the trend line that matters is the one most teams are not watching. Only about 14% of the top 10,000 domains had any AI-specific robots.txt rules as of mid-2025, which means the overwhelming majority of sites are making the training-vs-retrieval decision by accident. As AI-search referral grows into a measurable channel, the sites that have already separated these buckets in their logs will be the ones that can make the allow/block call deliberately, per bot, instead of discovering the consequences after the fact.

AI crawler traffic, year over year · Cloudflare network data

Source: Cloudflare, From Googlebot to GPTBot (May 2024–May 2025)

GooglebotIndexation · share of all crawler traffic rose 30% → 50%

~50%

GPTBotTraining · +305% YoY · share 2.2% → 7.7%

+305%

ChatGPT-UserUser-triggered retrieval · year-over-year growth

+2,825%

ClaudeBotTraining · share fell 11.7% → 5.4%

−46%

AI bots overallExcluding Googlebot · share of all HTML requests

4.2% avg

05 — Decision MatrixThe AI crawler decision matrix, 2026 edition.

This is the reference we wish existed when we started doing this work: every major AI crawler in one place, with the one thing most guides omit — what blocking it actually does to you — set beside the verification method and the robots.txt token you would use. Every user-agent token and JSON URL below is taken from the operator's own documentation; confirm the current values before you commit a rule, because operators do update them.

Crawler · purpose

Googlebot · Google · indexation

Effect of blocking

Removed from Google Search index — almost never what you want.

Verify · robots.txt token

Reverse DNS (googlebot.com) + IP-range JSON · token Googlebot

Crawler · purpose

Google-Extended · Google · Gemini/Vertex training

Effect of blocking

Opts out of Gemini/Vertex AI training. No impact on Google Search rankings.

Verify · robots.txt token

Same Googlebot fetcher · token Google-Extended (separate from Googlebot)

Crawler · purpose

GPTBot · OpenAI · training

Effect of blocking

Opts your content out of OpenAI foundation-model training data.

Verify · robots.txt token

IP JSON openai.com/gptbot.json · token GPTBot

Crawler · purpose

OAI-SearchBot · OpenAI · retrieval

Effect of blocking

Removes your site from ChatGPT search results.

Verify · robots.txt token

IP JSON openai.com/searchbot.json · token OAI-SearchBot

Crawler · purpose

ChatGPT-User · OpenAI · user-triggered

Effect of blocking

Blocks fetches a user explicitly asked ChatGPT to make on your page.

Verify · robots.txt token

Published IP ranges · token ChatGPT-User

Crawler · purpose

ClaudeBot · Anthropic · training

Effect of blocking

Opts your content out of Anthropic model training.

Verify · robots.txt token

IP list claude.com/crawling/bots.json · no reverse-DNS pattern · token ClaudeBot

Crawler · purpose

Claude-SearchBot · Anthropic · retrieval

Effect of blocking

Removes you from Claude search results.

Verify · robots.txt token

IP list (no PTR pattern) · token Claude-SearchBot

Crawler · purpose

PerplexityBot · Perplexity · retrieval

Effect of blocking

Removes you from Perplexity answers. Retrieval only — not used for training.

Verify · robots.txt token

IP JSON perplexity.com/perplexitybot.json · token PerplexityBot

Crawler · purpose

Perplexity-User · Perplexity · user-triggered

Effect of blocking

Per Perplexity's own docs, this agent generally ignores robots.txt — block at the edge if required.

Verify · robots.txt token

IP JSON perplexity.com/perplexity-user.json · robots.txt unreliable

Crawler · purpose	Effect of blocking	Verify · robots.txt token
Googlebot · Google · indexation	Removed from Google Search index — almost never what you want.	Reverse DNS (`googlebot.com`) + IP-range JSON · token `Googlebot`
Google-Extended · Google · Gemini/Vertex training	Opts out of Gemini/Vertex AI training. No impact on Google Search rankings.	Same Googlebot fetcher · token `Google-Extended` (separate from Googlebot)
GPTBot · OpenAI · training	Opts your content out of OpenAI foundation-model training data.	IP JSON `openai.com/gptbot.json` · token `GPTBot`
OAI-SearchBot · OpenAI · retrieval	Removes your site from ChatGPT search results.	IP JSON `openai.com/searchbot.json` · token `OAI-SearchBot`
ChatGPT-User · OpenAI · user-triggered	Blocks fetches a user explicitly asked ChatGPT to make on your page.	Published IP ranges · token `ChatGPT-User`
ClaudeBot · Anthropic · training	Opts your content out of Anthropic model training.	IP list `claude.com/crawling/bots.json` · no reverse-DNS pattern · token `ClaudeBot`
Claude-SearchBot · Anthropic · retrieval	Removes you from Claude search results.	IP list (no PTR pattern) · token `Claude-SearchBot`
PerplexityBot · Perplexity · retrieval	Removes you from Perplexity answers. Retrieval only — not used for training.	IP JSON `perplexity.com/perplexitybot.json` · token `PerplexityBot`
Perplexity-User · Perplexity · user-triggered	Per Perplexity's own docs, this agent generally ignores robots.txt — block at the edge if required.	IP JSON `perplexity.com/perplexity-user.json` · robots.txt unreliable

The verification gap to remember

Google and OpenAI both support IP-based verification, and Google publishes a reverse-DNS pattern (googlebot.com). Anthropic publishes an IP list but no reverse-DNS PTR pattern — so for ClaudeBot, matching against claude.com/crawling/bots.json is the only programmatic verification available. A request claiming to be ClaudeBot with an IP outside that list is a spoof.

06 — VerificationTrust the IP, never the user-agent.

The user-agent string is self-declared, which means it is trivially forged. Any scraper can send a header that reads like Googlebot. Counting those requests as genuine search-engine crawl inflates your numbers and corrupts every downstream conclusion. The fix is to verify the originating IP, and there are two official methods.

Reverse-DNS verification is Google's documented two-step process. First, run a reverse lookup on the request IP with host [IP]; for a genuine Googlebot the resulting domain should be googlebot.com, google.com, or googleusercontent.com. Second, run a forward lookup on that domain and confirm it resolves back to the same IP. Both steps must pass — a one-way match is not sufficient.

IP-list verification is the at-scale method. Google, OpenAI, and Anthropic each publish machine-readable JSON files of their crawler IP ranges, so you can validate a request programmatically against the published ranges without a DNS query per line. For ClaudeBot this is the only option, since Anthropic does not publish a reverse-DNS pattern.

Why this stopped being optional

In 2024 Cloudflare documented Perplexity running an undeclared crawler behind a generic Chrome user-agent after its declared PerplexityBot was blocked — on the order of millions of requests a day, observed across tens of thousands of domains. Cloudflare de-listed Perplexity as a verified bot in response. The lesson is concrete: user-agent matching alone cannot tell you who is really on your site. IP cross-checking is the floor.

Practically, your verification pass should run before any analysis. Tag every bot line in the log as verified or unverified by IP, then do all crawl-budget math on the verified set only. Unverified "Googlebot" traffic is its own finding — often a scraper or a spoofed AI crawler — and it belongs in the security review, not the SEO crawl-allocation chart. For the broader workflow of separating real bots from impersonators across a site, our agentic SEO service builds this verification step into ongoing monitoring rather than one-off audits.

07 — Finding WasteWhere the crawl leaks.

Once the log is filtered to verified search crawlers, the crawl-allocation question is simple to state and revealing to answer: what share of Googlebot's requests landed on URLs you actually want indexed? Group the requests by template and the leaks tend to announce themselves. The usual suspects are faceted-navigation permutations, internal search-result pages, deep pagination, session-ID and tracking parameters, and orphan pages that no internal link points to yet still draw bot traffic.

Industry estimates put the scale of this on large sites in the range of 30–50% of crawl budget consumed by non-essential pages, though the exact figure varies widely by site and the specific sourcing is inconsistent — treat it as an order-of-magnitude expectation, not a benchmark. The point is directional: on a big, parameter-heavy site, a substantial slice of Googlebot's effort is routinely spent on URLs that will never earn a ranking, and that slice is invisible to any tool that is not reading the real log.

Faceted navigation

Combinatorial URL blowup

1/m

Filter-and-sort permutations multiply into thousands of low-value URLs. In log analyses these often dominate Googlebot's requests while contributing nothing to the index. Restrict via robots.txt and canonicals, then re-check the log.

Top crawl-waste source

Orphan pages

Crawled but unlinked

0links

Pages a crawler reaches that have no internal links pointing to them. The log is the only place orphans with bot traffic show up — crawl emulators that follow links never find them.

Log-only discovery

Status-code drift

Errors served to bots only

5xx

A section throwing intermittent 5xx or redirect chains to Googlebot can be invisible to users and to Search Console's sampled reports. Sustained 503s reduce crawl frequency and eventually drop URLs.

Crawl-rate suppressor

Read case-study numbers as illustrative

Vendor write-ups cite dramatic recoveries — one e-commerce case where Googlebot spent the majority of its budget on never-indexed faceted URLs until they were restricted, and a retailer found to be wasting the overwhelming bulk of its crawl allocation, discoverable only in the logs. These are vendor-stated, single-site figures, not independently verified benchmarks. The durable takeaway is the mechanism, not the percentage: the waste was invisible to every external simulation and visible only in the server log.

The remediation loop is the same regardless of scale. Identify the wasteful template in the log, restrict it at the correct layer (robots.txt for crawl, a status code for retired content, canonicals and internal-linking changes for consolidation), then return to the log a few weeks later to confirm Googlebot redistributed its effort toward the pages you care about. Crawl frequency is also a function of internal link density, which is why log work pairs so naturally with an internal linking strategy for large sites — and why slow server response, the kind you would chase in a Core Web Vitals optimization pass, can suppress crawl rate by signalling capacity limits.

08 — ToolingFrom a spreadsheet to enterprise scale.

The right tool depends on log volume and how often you need to do this. The decision is less about features than about how many log events you are processing and whether crawl monitoring is continuous or occasional.

Small site · occasional

Spreadsheet or scripts

For modest logs and one-off audits, a filtered spreadsheet or a short parsing script does the job. You handle bot verification manually against the published IP ranges. Free, but it does not scale past a few thousand lines.

Start here

Mid-size · recurring

Dedicated log analyser

A purpose-built tool like Screaming Frog's Log File Analyser handles Apache, IIS, and Nginx formats plus ELB custom formats, auto-verifies bot legitimacy, and surfaces orphan pages and response-code inconsistencies. Free up to 1,000 events; an inexpensive annual licence above that.

The practical default

Enterprise · continuous

Botify or Lumar

Platforms that process millions of log events at scale, correlate logs with crawl data and analytics, and segment by page template — giving site-wide crawl-allocation views a spreadsheet cannot produce. Justified when crawl budget is genuinely a constraint.

For large sites

Whatever the tool, the workflow is identical: collect a representative log window, parse it into the standard fields, verify every bot line by IP, filter to the crawler class you are analysing, group by template, and compare against your intended crawl priorities. The tool changes; the method does not. Log analysis is one station in a larger technical-SEO pipeline, and it slots cleanly into a technical SEO audit checklist — and it pairs especially well with our companion 30-day site log study, which puts the methodology in this guide against real-world AI crawler data.

Log file analysis provides better insight than any other external crawl tool available.Builtvisible, The ultimate guide to log file analysis for SEO

09 — ConclusionThe log is the ground truth.

Technical SEO, May 2026

Server logs are where crawler reality and SEO intent finally meet.

Log file analysis has always been the technique that separates what you think crawlers do from what they actually do. In 2026 that gap is wider than ever, because the bots reaching your origin no longer share a single purpose. The same access log now carries indexation crawlers, training crawlers, and AI-search retrieval crawlers — and the right robots.txt decision is different for each one.

The practical sequence is unchanged in shape and richer in detail. Verify every bot by IP, because user-agent strings are forged and the Perplexity stealth-crawler episode proved the cost of trusting them. Filter to the crawler class you care about. Map crawl allocation against intent, and recover the budget leaking into faceted navigation, orphan pages, and error-throwing templates. Then return to the log to confirm the change landed — the one verification step no simulation can give you.

The forward-looking move is to start accounting for the AI buckets now, while only a small minority of sites do. AI-search referral is becoming a measurable channel, training opt-outs are a real data-rights lever, and both decisions are only legible in the log. The teams that separate these three buckets today will make deliberate, per-bot calls tomorrow — instead of discovering, after the fact, that an accidental block cost them visibility in the surfaces their customers are starting to search from.

Log File Analysis for SEO: The 2026 Crawl-Budget Guide

01 — Why LogsWhat Search Console can't tell you.

02 — Crawl BudgetCapacity, demand, and who actually needs to care.

03 — Reading the LogEvery field in the combined log, mapped to an SEO use.

04 — The AI SplitThree kinds of bot, three different decisions.

Googlebot & Bingbot

GPTBot & ClaudeBot

OAI-SearchBot & PerplexityBot

AI crawler traffic, year over year · Cloudflare network data

05 — Decision MatrixThe AI crawler decision matrix, 2026 edition.

06 — VerificationTrust the IP, never the user-agent.

07 — Finding WasteWhere the crawl leaks.

Combinatorial URL blowup

Crawled but unlinked

Errors served to bots only

08 — ToolingFrom a spreadsheet to enterprise scale.

Spreadsheet or scripts

Dedicated log analyser

Botify or Lumar

09 — ConclusionThe log is the ground truth.

Server logs are where crawler reality and SEO intent finally meet.

Recover wasted crawl budget and own your AI crawler strategy.

Technical SEO log engagements

The questions we get every week.

Continue exploring technical SEO.

AI Crawler Access Control: The 2026 Decision Matrix

Agentic Crawler Behavior: 30-Day Site Log Study 2026

Faceted Navigation Indexation: SEO Decision Matrix

Internal Linking Strategy 2026: Large-Site SEO Guide