Log file analysis is the only SEO technique that shows what crawlers actually did on your site, request by request, rather than what a third-party tool simulates they might do. Every other source — Search Console, crawl emulators, rank trackers — is a model of crawler behaviour. The server log is the ground truth.
That distinction matters more in 2026 than it has at any point in the discipline's history. The bot population hitting a typical site has fractured: Googlebot and Bingbot still crawl for the index, but they now share the access log with training crawlers like GPTBot and ClaudeBot, and a separate class of retrieval bots — OAI-SearchBot, Claude-SearchBot, PerplexityBot — that fetch pages live to answer questions inside AI assistants. Each class wants a different robots.txt decision, and the log is the only place you can see which ones are actually reaching you.
This guide is a working reference. It explains how Google defines and spends crawl budget, how to read the combined log format field by field, the three-way AI crawler split that is unique to this era, how to verify that a request claiming to be Googlebot really is, and how to translate all of that into crawl-budget recovery. Everything is sourced to primary documentation — Google Search Central, the bot operators' own docs, and Cloudflare's network-scale data.
- 01Logs are the only record of real crawler behaviour.Search Console shows what Google indexed; it does not record which URLs Googlebot requested or when. Server logs are the only data source for request-level bot activity — every other tool is a simulation.
- 02Crawl budget is capacity plus demand.Google defines it as the URLs it can and wants to crawl: a crawl-capacity limit (connections plus fetch delay) and crawl demand (popularity, staleness, perceived inventory). It only matters for large or fast-changing sites.
- 03AI crawlers now need their own accounting.Per Cloudflare, GPTBot raw requests grew +305% year over year and AI bots averaged 4.2% of all HTML requests in 2025. A log that lumps them in with Googlebot hides the decision you actually have to make.
- 04Training, retrieval, and indexation are three buckets.Blocking GPTBot opts out of training; blocking OAI-SearchBot removes you from ChatGPT search results; blocking Googlebot removes you from the index. Same access log, three completely different robots.txt strategies.
- 05User-agent strings can be spoofed — verify with IPs.Cloudflare documented Perplexity running an undeclared crawler behind a generic Chrome user-agent after its declared bot was blocked. Reverse DNS or IP-list cross-check is the only reliable way to confirm a bot is what it claims.
01 — Why LogsWhat Search Console can't tell you.
Google Search Console is indispensable, but it answers a narrow question: what is the indexed state of my URLs? Its Coverage report and URL Inspection tool tell you whether a page is indexed, excluded, or discovered-not-crawled — but they do not record which URLs Googlebot actually requested, in what order, or how often. For that request-level picture, the server log is the only source.
A log file is, in the plainest terms, the web server's own record of every request it received. Each line captures who asked for what, when, and how the server responded. Because it is written by the server at the moment of the request, it cannot be gamed by a crawl emulator's assumptions or a sampling window — it is the complete, unedited account of crawler activity on the origin.
Search Console shows what Google has indexed, but it does not show what Google actually does on your website. For that, you need a log file analysis.Visively, Log File Analysis for Technical SEO
The gap is widest exactly where it hurts most. A page that is crawled but never indexed, an orphan page that gets bot traffic despite having no internal links, a section quietly throwing 5xx errors to Googlebot only — none of these are visible in the indexing reports, and all of them are obvious in the log. Treating the log as a complement to Search Console rather than a replacement is the correct framing: the two answer different questions.
02 — Crawl BudgetCapacity, demand, and who actually needs to care.
Google defines crawl budget as the set of URLs that Google can and wants to crawl, and it is governed by two components. The crawl-capacity limitis how much Google can fetch without overloading your server — the maximum number of simultaneous connections and the delay between fetches, adjusted up or down based on your server's observed response health. Crawl demandis how much Google wants to crawl, driven by a URL's popularity, its staleness, and Google's perception of your overall inventory.
The honest framing — and Google's own — is that crawl budget is not a problem most sites need to manage. Google's guidance flags it as a priority for large sites (roughly a million-plus unique pages with weekly changes), medium-to-large sites (in the order of ten thousand-plus pages changing daily), or any site with a meaningful pile of "Discovered — currently not indexed" URLs in Search Console. Those page counts are rough estimates, not hard thresholds. If you run a 200-page brochure site, crawl budget is not your bottleneck.
noindex to manage crawl budget. Google will still request the page to read the directive, so the crawl quota is spent either way. To keep URLs out of the crawl queue entirely, use robots.txt to block the path, or return a 404 / 410 for content that is genuinely gone.This is the most common crawl-budget mistake we see in audits: teams sprinkle noindex across faceted URLs or thin archive pages expecting it to relieve crawl pressure, then wonder why the log still shows Googlebot hammering those paths. The directive controls indexation, not crawling. If the goal is to stop the crawl, the decision belongs in robots.txt or in the response status code, and the log is where you confirm the change actually landed.
03 — Reading the LogEvery field in the combined log, mapped to an SEO use.
The Apache/Nginx combined log format is the de-facto standard for SEO analysis, and the W3C Extended format used by Microsoft IIS carries the same information. Once you can read one line, you can read the whole file. Each request records the client IP, the timestamp, the request line, the HTTP status, the bytes sent, the referrer, and the user-agent. The table below is our field-to-insight reference: what each field is, and the SEO diagnosis it unlocks.
$remote_addr$time_local$request$status$body_bytes_sent$http_referer$http_user_agent| Log field | What it captures | SEO diagnosis it enables |
|---|---|---|
$remote_addr | Client IP address | Bot verification. The IP is what you cross-check against the operator's published range or via reverse DNS — the only reliable way to confirm a self-declared Googlebot is genuine. |
$time_local | Request timestamp | Crawl frequency and recency. Shows how often a URL or template is crawled, whether a new page was discovered, and how quickly Googlebot returns after a change. |
$request | Method + URL + protocol | Crawl allocation by template. Group the request paths to see how much budget goes to product pages vs faceted navigation, internal search, or pagination. |
$status | HTTP response code | Error and waste detection. Spikes in 4xx/5xx served to bots, redirect chains via 3xx, and 304 (Not Modified) responses that preserve budget all surface here. |
$body_bytes_sent | Response size in bytes | Payload bloat. Unusually large responses to bots can flag uncompressed pages or rendered bloat that slows crawl rate and wastes capacity. |
$http_referer | Referring URL | Internal-link path discovery. For bot requests this is sparse, but for human traffic it helps separate organic from referral when correlating logs with analytics. |
$http_user_agent | Self-declared client string | Bot classification — the starting point, never the end. The string tells you which crawler claims to be visiting; it must be verified by IP because it is trivially spoofed. |
Two status codes deserve their own note because they are routinely misread. An HTTP 304 (Not Modified) is a good sign: Googlebot asked whether cached content had changed, the server said no, and the body was not re-fetched — budget preserved for pages that did change. An HTTP 410 (Gone) is processed faster than a 404 for URLs you have intentionally retired, dropping them from the index more quickly. Persistent 503 errors are the opposite of harmless: sustained over days or weeks, they cause Google to reduce crawl frequency and eventually drop the affected URLs.
04 — The AI SplitThree kinds of bot, three different decisions.
The defining change to log analysis in this era is that "bot traffic" is no longer one thing. A pre-2024 guide could treat every crawler as an indexing bot. That assumption is now wrong. The modern access log contains three functionally distinct classes, and conflating them hides the only decision that matters.
Googlebot & Bingbot
The classic search crawlers. Blocking them removes you from the traditional search index. This is the bucket every crawl-budget guide has always been about — and the one you almost never want to restrict.
GPTBot & ClaudeBot
These collect content to train foundation models. Per OpenAI, disallowing GPTBot signals that your content should not be used in training. Blocking them is a data-rights decision, not a visibility one.
OAI-SearchBot & PerplexityBot
These fetch pages in real time to answer questions inside ChatGPT search, Claude search, and Perplexity. Blocking them removes you from AI-search results — increasingly a real referral channel.
The interpretation we draw from the network-scale data is that this split is not a future concern — it is already material. Per Cloudflare, AI bots (excluding Googlebot) averaged 4.2% of all HTML requests across its network in 2025, peaking at 6.4% in late June. Within the AI category the growth has been wildly uneven: GPTBot raw requests grew +305% year over year, raising its share of crawler traffic from 2.2% to 7.7%, while ClaudeBot fell −46% over the same window. PerplexityBot grew +157,490% — but from near zero, which is why that figure is misleading on its own; in absolute terms PerplexityBot still crawled only a tiny fraction of sampled pages against Googlebot's share.
Looking forward, the trend line that matters is the one most teams are not watching. Only about 14% of the top 10,000 domains had any AI-specific robots.txt rules as of mid-2025, which means the overwhelming majority of sites are making the training-vs-retrieval decision by accident. As AI-search referral grows into a measurable channel, the sites that have already separated these buckets in their logs will be the ones that can make the allow/block call deliberately, per bot, instead of discovering the consequences after the fact.
AI crawler traffic, year over year · Cloudflare network data
Source: Cloudflare, From Googlebot to GPTBot (May 2024–May 2025)05 — Decision MatrixThe AI crawler decision matrix, 2026 edition.
This is the reference we wish existed when we started doing this work: every major AI crawler in one place, with the one thing most guides omit — what blocking it actually does to you — set beside the verification method and the robots.txt token you would use. Every user-agent token and JSON URL below is taken from the operator's own documentation; confirm the current values before you commit a rule, because operators do update them.
googlebot.com) + IP-range JSON · token GooglebotGoogle-Extended (separate from Googlebot)openai.com/gptbot.json · token GPTBotopenai.com/searchbot.json · token OAI-SearchBotChatGPT-Userclaude.com/crawling/bots.json · no reverse-DNS pattern · token ClaudeBotClaude-SearchBotperplexity.com/perplexitybot.json · token PerplexityBotperplexity.com/perplexity-user.json · robots.txt unreliable| Crawler · purpose | Effect of blocking | Verify · robots.txt token |
|---|---|---|
| Googlebot · Google · indexation | Removed from Google Search index — almost never what you want. | Reverse DNS (googlebot.com) + IP-range JSON · token Googlebot |
| Google-Extended · Google · Gemini/Vertex training | Opts out of Gemini/Vertex AI training. No impact on Google Search rankings. | Same Googlebot fetcher · token Google-Extended (separate from Googlebot) |
| GPTBot · OpenAI · training | Opts your content out of OpenAI foundation-model training data. | IP JSON openai.com/gptbot.json · token GPTBot |
| OAI-SearchBot · OpenAI · retrieval | Removes your site from ChatGPT search results. | IP JSON openai.com/searchbot.json · token OAI-SearchBot |
| ChatGPT-User · OpenAI · user-triggered | Blocks fetches a user explicitly asked ChatGPT to make on your page. | Published IP ranges · token ChatGPT-User |
| ClaudeBot · Anthropic · training | Opts your content out of Anthropic model training. | IP list claude.com/crawling/bots.json · no reverse-DNS pattern · token ClaudeBot |
| Claude-SearchBot · Anthropic · retrieval | Removes you from Claude search results. | IP list (no PTR pattern) · token Claude-SearchBot |
| PerplexityBot · Perplexity · retrieval | Removes you from Perplexity answers. Retrieval only — not used for training. | IP JSON perplexity.com/perplexitybot.json · token PerplexityBot |
| Perplexity-User · Perplexity · user-triggered | Per Perplexity's own docs, this agent generally ignores robots.txt — block at the edge if required. | IP JSON perplexity.com/perplexity-user.json · robots.txt unreliable |
googlebot.com). Anthropic publishes an IP list but no reverse-DNS PTR pattern — so for ClaudeBot, matching against claude.com/crawling/bots.json is the only programmatic verification available. A request claiming to be ClaudeBot with an IP outside that list is a spoof.06 — VerificationTrust the IP, never the user-agent.
The user-agent string is self-declared, which means it is trivially forged. Any scraper can send a header that reads like Googlebot. Counting those requests as genuine search-engine crawl inflates your numbers and corrupts every downstream conclusion. The fix is to verify the originating IP, and there are two official methods.
Reverse-DNS verificationis Google's documented two-step process. First, run a reverse lookup on the request IP with host [IP]; for a genuine Googlebot the resulting domain should be googlebot.com, google.com, or googleusercontent.com. Second, run a forward lookup on that domain and confirm it resolves back to the same IP. Both steps must pass — a one-way match is not sufficient.
IP-list verification is the at-scale method. Google, OpenAI, and Anthropic each publish machine-readable JSON files of their crawler IP ranges, so you can validate a request programmatically against the published ranges without a DNS query per line. For ClaudeBot this is the only option, since Anthropic does not publish a reverse-DNS pattern.
Practically, your verification pass should run before any analysis. Tag every bot line in the log as verified or unverified by IP, then do all crawl-budget math on the verified set only. Unverified "Googlebot" traffic is its own finding — often a scraper or a spoofed AI crawler — and it belongs in the security review, not the SEO crawl-allocation chart. For the broader workflow of separating real bots from impersonators across a site, our agentic SEO service builds this verification step into ongoing monitoring rather than one-off audits.
07 — Finding WasteWhere the crawl leaks.
Once the log is filtered to verified search crawlers, the crawl-allocation question is simple to state and revealing to answer: what share of Googlebot's requests landed on URLs you actually want indexed? Group the requests by template and the leaks tend to announce themselves. The usual suspects are faceted-navigation permutations, internal search-result pages, deep pagination, session-ID and tracking parameters, and orphan pages that no internal link points to yet still draw bot traffic.
Industry estimates put the scale of this on large sites in the range of 30–50% of crawl budget consumed by non-essential pages, though the exact figure varies widely by site and the specific sourcing is inconsistent — treat it as an order-of-magnitude expectation, not a benchmark. The point is directional: on a big, parameter-heavy site, a substantial slice of Googlebot's effort is routinely spent on URLs that will never earn a ranking, and that slice is invisible to any tool that is not reading the real log.
Combinatorial URL blowup
Filter-and-sort permutations multiply into thousands of low-value URLs. In log analyses these often dominate Googlebot's requests while contributing nothing to the index. Restrict via robots.txt and canonicals, then re-check the log.
Crawled but unlinked
Pages a crawler reaches that have no internal links pointing to them. The log is the only place orphans with bot traffic show up — crawl emulators that follow links never find them.
Errors served to bots only
A section throwing intermittent 5xx or redirect chains to Googlebot can be invisible to users and to Search Console's sampled reports. Sustained 503s reduce crawl frequency and eventually drop URLs.
The remediation loop is the same regardless of scale. Identify the wasteful template in the log, restrict it at the correct layer (robots.txt for crawl, a status code for retired content, canonicals and internal-linking changes for consolidation), then return to the log a few weeks later to confirm Googlebot redistributed its effort toward the pages you care about. Crawl frequency is also a function of internal link density, which is why log work pairs so naturally with an internal linking strategy for large sites — and why slow server response, the kind you would chase in a Core Web Vitals optimization pass, can suppress crawl rate by signalling capacity limits.
08 — ToolingFrom a spreadsheet to enterprise scale.
The right tool depends on log volume and how often you need to do this. The decision is less about features than about how many log events you are processing and whether crawl monitoring is continuous or occasional.
Spreadsheet or scripts
For modest logs and one-off audits, a filtered spreadsheet or a short parsing script does the job. You handle bot verification manually against the published IP ranges. Free, but it does not scale past a few thousand lines.
Dedicated log analyser
A purpose-built tool like Screaming Frog's Log File Analyser handles Apache, IIS, and Nginx formats plus ELB custom formats, auto-verifies bot legitimacy, and surfaces orphan pages and response-code inconsistencies. Free up to 1,000 events; an inexpensive annual licence above that.
Botify or Lumar
Platforms that process millions of log events at scale, correlate logs with crawl data and analytics, and segment by page template — giving site-wide crawl-allocation views a spreadsheet cannot produce. Justified when crawl budget is genuinely a constraint.
Whatever the tool, the workflow is identical: collect a representative log window, parse it into the standard fields, verify every bot line by IP, filter to the crawler class you are analysing, group by template, and compare against your intended crawl priorities. The tool changes; the method does not. Log analysis is one station in a larger technical-SEO pipeline, and it slots cleanly into a technical SEO audit checklist — and it pairs especially well with our companion 30-day site log study, which puts the methodology in this guide against real-world AI crawler data.
Log file analysis provides better insight than any other external crawl tool available.Builtvisible, The ultimate guide to log file analysis for SEO
09 — ConclusionThe log is the ground truth.
Server logs are where crawler reality and SEO intent finally meet.
Log file analysis has always been the technique that separates what you think crawlers do from what they actually do. In 2026 that gap is wider than ever, because the bots reaching your origin no longer share a single purpose. The same access log now carries indexation crawlers, training crawlers, and AI-search retrieval crawlers — and the right robots.txt decision is different for each one.
The practical sequence is unchanged in shape and richer in detail. Verify every bot by IP, because user-agent strings are forged and the Perplexity stealth-crawler episode proved the cost of trusting them. Filter to the crawler class you care about. Map crawl allocation against intent, and recover the budget leaking into faceted navigation, orphan pages, and error-throwing templates. Then return to the log to confirm the change landed — the one verification step no simulation can give you.
The forward-looking move is to start accounting for the AI buckets now, while only a small minority of sites do. AI-search referral is becoming a measurable channel, training opt-outs are a real data-rights lever, and both decisions are only legible in the log. The teams that separate these three buckets today will make deliberate, per-bot calls tomorrow — instead of discovering, after the fact, that an accidental block cost them visibility in the surfaces their customers are starting to search from.