Robots.txt and Meta Robots: Complete SEO Reference
Complete 2026 reference to robots.txt directives and meta robots tags — crawling, indexing, noindex, X-Robots-Tag, and JS rendering pitfalls.
Robots.txt max size
Sites with robots.txt
Disallow + noindex conflicts
Noindex propagation
Key Takeaways
Robots.txt syntax and directives
Robots.txt is a plain-text file at the root of a domain that tells crawlers which URL paths they may fetch. It lives at /robots.txt and is fetched on nearly every first visit by every well-behaved crawler. The file controls crawling, not indexing — a critical distinction that catches out roughly 30% of the technical audits we run.
User-agent: *
Allow: /
Sitemap: https://www.example.com/sitemap.xmlThis is the default for most marketing sites. User-agent: * applies to every crawler. Allow: / permits every path. The Sitemap line is not a crawl directive but a discovery hint and is read by all major crawlers.
Each group of rules begins with a User-agent line. The match is case-insensitive and uses longest-match-wins. A crawler reads only the most specific matching group and ignores other groups entirely, so splitting rules across multiple generic and specific groups rarely behaves as expected.
User-agent: Googlebot
Disallow: /private/
User-agent: Bingbot
Disallow: /private/
Disallow: /beta/
User-agent: *
Disallow: /internal/Disallow blocks a path; Allow grants an exception within a blocked path. Both take a URL path starting with /. Empty Disallow: allows everything for that group. More specific (longer) rules win regardless of order in the file.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /search
Disallow: /cart
Disallow: /checkoutOne or more Sitemap lines declare sitemap locations. These are independent of User-agent groups and can appear anywhere in the file. Always use an absolute URL. Multiple sitemaps are valid for sites with separate product, blog, and image sitemaps.
Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap-products.xml
Sitemap: https://www.example.com/sitemap-blog.xmlCrawl-delay sets seconds between requests. Bing, Yandex, and Seznam honor it. Google ignores crawl-delay entirely and instead manages rate through Search Console settings, response status codes, and server response times. If Googlebot is overwhelming your origin, return 503 with a Retry-After header rather than adding crawl-delay.
Anything after # on a line is a comment. Useful for documenting why a rule exists so the next engineer does not remove it in a cleanup pass.
# Block faceted navigation parameters to save crawl budget
User-agent: *
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=Size limit: Google fetches and parses the first 500 KiB of robots.txt. Anything past that byte boundary is ignored. Most sites never approach this; enterprise sites with thousands of rules should audit total file size and collapse patterns where possible.
Pattern matching and wildcards
Google, Bing, and Yandex all support two wildcard operators in robots.txt: the asterisk * matches any sequence of characters, and the dollar sign $ anchors a pattern to the end of the URL. These operators are powerful but deceptively fragile — a wildcard with the wrong prefix can block half your site in seconds.
Use * to match any sequence. This is most useful for query parameters and dynamic path segments. Note that / in the URL is matched by * too.
User-agent: *
# Block every URL with a session ID parameter
Disallow: /*?sessionid=
Disallow: /*&sessionid=
# Block any path containing /preview/
Disallow: /*/preview/
# Block tracking parameters globally
Disallow: /*?utm_
Disallow: /*&utm_$ marks the end of the URL string. Without it, a pattern matches URLs that start with the prefix and have anything afterward. This catches out anyone trying to block only PDFs or only a specific file extension.
User-agent: *
# Block every PDF anywhere on the site
Disallow: /*.pdf$
# Block query-stringless print pages but not print?query=true
Disallow: /*/print$
# Combine: block any URL ending in .xml except the sitemap
Disallow: /*.xml$
Allow: /sitemap.xml$Googlebot renders pages before indexing. If CSS or JavaScript is disallowed, Google sees an unstyled or broken version and may rank the page lower or classify it as cloaking. Explicitly allow asset paths your framework uses.
User-agent: *
Disallow: /internal/
# Always permit rendering assets
Allow: /*.css$
Allow: /*.js$
Allow: /_next/static/
Allow: /static/
Allow: /wp-content/themes/
Allow: /wp-includes/js/When Disallow and Allow both match a URL, Google picks the rule with the longer character count. Wildcards are counted as a single character. This differs from regex engines and frequently surprises developers writing their first rules.
User-agent: *
Disallow: /products/ # 10 chars
Allow: /products/featured/ # 19 chars — WINS for /products/featured/shoe
Result: /products/shoe is blocked, /products/featured/shoe is allowed.Test every rule: Search Console has a robots.txt Tester that validates rules against specific URLs with specific user-agents. Always test before deploying. A single stray Disallow: / in production can drop every page from crawl within a day.
Meta robots tag values
The meta robots tag is a <meta> element placed in the <head> of an HTML document that instructs crawlers how to index and display the page. Unlike robots.txt, which only controls crawling, meta robots directly controls indexing behavior. Values can be combined in a comma-separated list and are case-insensitive.
Without a meta robots tag, Google defaults to index, follow — the page can be indexed and links can pass PageRank. Adding noindex removes the page from the index; adding nofollow prevents link equity from flowing through any on-page links.
<!-- Default behavior, no tag needed -->
<meta name="robots" content="index, follow">
<!-- Remove from index, still crawl links -->
<meta name="robots" content="noindex, follow">
<!-- Do not pass PageRank through links -->
<meta name="robots" content="index, nofollow">
<!-- Full block from index and no link equity -->
<meta name="robots" content="noindex, nofollow">Replace name="robots" with a specific user-agent to target only that crawler. Conflicting directives use the most restrictive combination across all applicable tags.
<!-- Applies to all crawlers -->
<meta name="robots" content="noindex">
<!-- Googlebot only -->
<meta name="googlebot" content="noindex">
<!-- Block only Google AI training (Google Search still indexes) -->
<meta name="Google-Extended" content="noindex">noarchive prevents Google from offering a cached copy of the page. nosnippet hides the text snippet and the video preview in search results, leaving only the title and URL. Publishers use these to prevent content from appearing in AI Overviews while remaining indexable.
<!-- Index the page, no cached copy -->
<meta name="robots" content="noarchive">
<!-- Index the page, no snippet shown in results or AI features -->
<meta name="robots" content="nosnippet">These three directives give precise control over snippet length, image preview size, and video preview duration. max-snippet takes a character count (-1 for unlimited, 0 to disable). max-image-preview accepts none, standard, or large. max-video-preview takes seconds (-1 for unlimited, 0 for none).
<!-- Recommended for editorial content -->
<meta name="robots" content="max-snippet:-1, max-image-preview:large, max-video-preview:-1">
<!-- Restrict to short snippets and no large images -->
<meta name="robots" content="max-snippet:160, max-image-preview:standard">Tells Google to drop the URL from the index after a specific RFC 850 date. Useful for time-limited offers, event pages, or seasonal campaigns. Google stops showing the URL in Search roughly 24 hours after the date passes.
<meta name="robots" content="unavailable_after: 2026-12-31T23:59:59+00:00">Combining values: When multiple meta robots tags or values conflict, Google uses the most restrictive combination. noindex always wins over index; nofollow always wins over follow. For a full refresher on crawler behavior, see our how search engines work guide.
X-Robots-Tag HTTP header
X-Robots-Tag is an HTTP response header that carries the same directives as meta robots but works on any resource, not just HTML. Use it when you cannot add a meta tag: PDFs, images, videos, binary files, or responses generated by frameworks where modifying the HTML head is awkward. The header accepts the same values as meta robots and can also target specific crawlers.
X-Robots-Tag: noindex
X-Robots-Tag: noindex, nofollow
X-Robots-Tag: googlebot: noindex, nosnippet
X-Robots-Tag: unavailable_after: 2026-12-31T00:00:00+00:00<FilesMatch "\.pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</FilesMatch>
<FilesMatch "\.(doc|docx|xls|xlsx)$">
Header set X-Robots-Tag "noindex"
</FilesMatch># Block PDF indexing site-wide
location ~* \.pdf$ {
add_header X-Robots-Tag "noindex, nofollow";
}
# Block the entire staging subdomain
server {
server_name staging.example.com;
add_header X-Robots-Tag "noindex, nofollow" always;
# ... rest of config
}// next.config.ts
export default {
async headers() {
return [
{
source: "/internal/:path*",
headers: [
{ key: "X-Robots-Tag", value: "noindex, nofollow" },
],
},
{
source: "/(.*\\.pdf)",
headers: [
{ key: "X-Robots-Tag", value: "noindex" },
],
},
];
},
};PDFs often leak into Search: Marketing teams upload whitepapers, internal reports, and legal documents as PDFs without realising Google indexes them aggressively. If a PDF should not appear in Search, set X-Robots-Tag: noindex at the webserver layer. Meta tags do not exist on PDFs.
CMS-specific robots patterns
Every major CMS ships with a default robots configuration that works for the average site but rarely for yours. Below are the defaults and the overrides we apply in audits for WordPress, Shopify, Next.js app router, Webflow, and Squarespace.
WordPress
WordPress serves a virtual /robots.txt from wp-includes/functions.php unless a physical file exists. The default blocks /wp-admin/ and allows /wp-admin/admin-ajax.php. Yoast and Rank Math both let you edit robots.txt from the dashboard, which overrides the virtual file.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-login.php
Disallow: /?s=
Disallow: /search/
# Ensure assets remain crawlable
Allow: /wp-content/uploads/
Allow: /wp-content/themes/*.css$
Allow: /wp-content/themes/*.js$
Allow: /wp-includes/js/
Sitemap: https://www.example.com/sitemap_index.xmlShopify
Shopify ships a locked default robots.txt that blocks cart, checkout, search, and a handful of Liquid endpoints. Since the 2021 update, merchants can customize via a robots.txt.liquid template. Common additions: block tag archives, collection sort parameters, and the /collections/vendors URL space.
{% comment %} robots.txt.liquid overrides {% endcomment %}
{% for group in robots.default_groups %}
{{- group.user_agent }}
{% for rule in group.rules %}
{{ rule }}
{% endfor %}
# Custom rules
Disallow: /collections/*+*
Disallow: /*?pf_t_
Disallow: /collections/vendors
{% if group.sitemap != blank %}
{{ group.sitemap }}
{% endif %}
{% endfor %}Next.js app router
Next.js supports both a static public/robots.txt and a dynamic app/robots.ts file that exports Metadata API rules. The dynamic version is easier to version-control alongside sitemap generation.
// app/robots.ts
import type { MetadataRoute } from "next";
export default function robots(): MetadataRoute.Robots {
return {
rules: [
{
userAgent: "*",
allow: "/",
disallow: ["/api/", "/dashboard/", "/auth/"],
},
{
userAgent: "GPTBot",
disallow: "/",
},
],
sitemap: "https://www.example.com/sitemap.xml",
host: "https://www.example.com",
};
}Webflow
Webflow generates robots.txt from Project Settings, SEO tab, Indexing. Hosting on the webflow.io subdomain auto-generates Disallow: / so the preview domain never ranks. After connecting a custom domain, edit the robots.txt field to allow crawling and reference the CMS-generated sitemap at /sitemap.xml.
Squarespace
Squarespace manages robots.txt automatically and does not expose it for editing. Page-level noindex is controlled from SEO Settings on each page. For sitewide noindex during development, enable the Site-wide Noindex toggle in Home, Marketing, SEO. Remove it before launch — about 10% of Squarespace audits we run find this toggle still active months after go-live.
Staging domain hygiene: Every CMS supports a staging or preview environment. Apply sitewide noindex via meta robots, X-Robots-Tag, or HTTP basic auth on staging. Do not rely on robots.txt alone — disallowed URLs can still leak into Search via external links.
JavaScript-rendered pages and SPAs
Single-page apps built with React, Vue, Svelte, or Angular mount the UI client-side. Meta robots tags injected after hydration introduce subtle timing issues that can make noindex fail silently or take weeks longer to propagate. Understanding how Googlebot renders JavaScript is critical for SPAs.
Googlebot fetches the raw HTML first and extracts links, canonicals, and the initial meta robots tag from the static markup. The page is then queued for rendering in a headless Chromium pool. Rendering can take seconds to days depending on load. If the meta robots tag only exists after JavaScript-execution, the first-pass indexing decision is made from the initial HTML — usually with index, follow by default.
Server-rendered frameworks like Next.js, Nuxt, Remix, and SvelteKit include meta robots in the initial HTML response. This is the only reliable way to signal noindex to Googlebot.
// Next.js app router — per-route noindex
export const metadata = {
robots: {
index: false,
follow: true,
nocache: true,
},
};
// Produces in the HTML head:
// <meta name="robots" content="noindex,nocache" />Libraries like React Helmet or Vue Head mutate the document head after mount. Googlebot does honor the rendered-HTML meta robots eventually, but the first crawl uses raw HTML. If your site only adds noindex via JavaScript, expect a lag of days or weeks before pages leave the index.
// React Helmet — runs only after client mount
import { Helmet } from "react-helmet-async";
function PrivatePage() {
return (
<>
<Helmet>
<meta name="robots" content="noindex" />
</Helmet>
<Dashboard />
</>
);
}
// Raw HTML initially has no meta robots tag.
// Googlebot indexes the route on first crawl, then updates later.A classic SPA bug: the server returns 200 for every route and the client decides whether to show content or an error. If a user hits a deleted URL, the client renders a friendly "Not found" message but the HTTP status is still 200. Google flags these as soft 404s and wastes crawl budget. Fix by routing unknown paths to a URL that genuinely returns 404, or by pre-rendering known routes and returning proper status codes server-side. See our HTTP status codes reference for the full breakdown.
Verify with URL Inspection: Search Console URL Inspection shows the rendered HTML Google actually sees. Compare View crawled page with View tested page to confirm your meta robots tag survives rendering. If the rendered HTML lacks the tag, move the directive server-side.
Canonical, noindex, and blocking strategies
Robots.txt disallow, meta noindex, and rel=canonical each solve different problems. Combining them incorrectly is the most common cause of indexing regressions we find during client audits. The table below shows when to use each mechanism and what fails when they are mixed carelessly.
If robots.txt blocks a URL, Googlebot cannot fetch the page, which means it cannot read the meta robots tag. The URL remains eligible for indexing via external links and shows up in Search with no snippet. To guarantee a URL is de-indexed, allow crawling and apply noindex via meta tag or X-Robots-Tag. Once the page is removed from the index, disallow can be added back if crawl budget is a concern.
rel=canonical tells Google that two or more URLs represent the same content and that the canonical URL should be the one indexed. The non-canonical versions remain crawlable and pass signals. noindex removes the page from the index entirely. Use canonical for duplicate URLs (parameters, print versions, mobile/desktop splits) and noindex for pages that should never rank (internal search results, user dashboards, thin tag archives).
When a URL is currently indexed and must be removed, follow this sequence.
- Ensure robots.txt does not disallow the URL. Googlebot must be able to crawl the page to read any noindex directive.
- Add meta robots content="noindex" or X-Robots-Tag: noindex to the response.
- Submit the URL via Search Console Removals for a six-month temporary hide while the noindex propagates.
- Wait for Googlebot to re-crawl. Confirm removal in URL Inspection — the Indexing state should switch to "Excluded by noindex tag."
- Once de-indexed, optionally add the URL to robots.txt disallow to save future crawl budget. Do not add disallow before the page is out of the index.
Every major AI provider now operates a dedicated crawler for training data collection. Google-Extended controls Google Gemini and Vertex AI training and is separate from Googlebot — blocking Google-Extended does not affect Search. OpenAI operates GPTBot, Common Crawl operates CCBot, Anthropic operates ClaudeBot, Perplexity operates PerplexityBot, and Apple operates Applebot-Extended for Apple Intelligence. All respect robots.txt. AI Overviews in Google Search are powered by Google-Extended-trained models plus live retrieval, so nosnippet also matters for controlling what surfaces.
# Block all major AI training crawlers
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
# Keep normal search indexing open
User-agent: *
Allow: /
Sitemap: https://www.example.com/sitemap.xmlAI Overviews and nosnippet: Google has confirmed that pages with meta robots nosnippet will not appear as sources in AI Overviews. If you want to stay indexed for Search but out of AI answer boxes, apply nosnippet rather than noindex. For broader AI-blocking strategy, combine nosnippet in meta tags with Google-Extended and GPTBot disallow in robots.txt, and cross-reference our technical SEO audit checklist and SEO glossary for supporting terminology.
Audit Crawl and Index Control
Misconfigured robots.txt and stale noindex tags silently drain traffic. Our technical SEO audits catch them, and we ship the fixes alongside ongoing indexing governance.
Frequently Asked Questions
Related Articles
Continue exploring with these related guides