SEOTechnical reference2026 edition

Robots.txt and Meta Robots: Complete SEO Reference

Complete 2026 reference to robots.txt directives and meta robots tags — crawling, indexing, noindex, X-Robots-Tag, and JS rendering pitfalls.

Digital Applied Team

April 11, 2026• Updated May 22, 2026

14 min read

500 KiB

Robots.txt max size

~92%

Sites with robots.txt

~30%

Disallow + noindex conflicts

1-4 weeks

Noindex propagation

Key Takeaways

Robots.txt blocks crawling, not indexing:: A disallowed URL can still appear in Search if external links point to it. Google will show the URL with no snippet and the note that a description is unavailable.

Never combine disallow with noindex:: If robots.txt blocks the URL, Googlebot cannot fetch the page and will never see the meta noindex tag. The URL stays indexable. Allow crawling, then apply noindex.

Paths are case-sensitive:: Disallow: /Admin/ does not block /admin/. Audit production URL casing before writing rules, and match the exact path your CMS serves.

Wildcards are supported but fragile:: Google honors * and $, but pattern precedence is longest-match-wins. Test every rule against real URLs in Search Console robots.txt Tester before deploying.

Meta robots only works on HTML:: PDFs, images, and non-HTML assets need the X-Robots-Tag HTTP header. Serving noindex as meta on a PDF does nothing.

AI crawlers respect robots.txt:: Google-Extended, GPTBot, ClaudeBot, CCBot, and PerplexityBot all read robots.txt. Block them explicitly if you do not want your content used for model training.

Noindex takes one full crawl:: De-indexing only happens after Googlebot re-fetches the URL and sees the tag. For urgent removals, use the Removals tool in Search Console, not just meta robots.

Robots.txt syntax and directives

Robots.txt is a plain-text file at the root of a domain that tells crawlers which URL paths they may fetch. It lives at /robots.txt and is fetched on nearly every first visit by every well-behaved crawler. The file controls crawling, not indexing — a critical distinction that catches out roughly 30% of the technical audits we run.

Minimum viable robots.txt

Allow everything, declare a sitemap

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

This is the default for most marketing sites. User-agent: * applies to every crawler. Allow: / permits every path. The Sitemap line is not a crawl directive but a discovery hint and is read by all major crawlers.

User-agent

Targets rules at specific crawlers

Each group of rules begins with a User-agent line. The match is case-insensitive and uses longest-match-wins. A crawler reads only the most specific matching group and ignores other groups entirely, so splitting rules across multiple generic and specific groups rarely behaves as expected.

User-agent: Googlebot
Disallow: /private/

User-agent: Bingbot
Disallow: /private/
Disallow: /beta/

User-agent: *
Disallow: /internal/

Disallow and Allow

Path-based crawl restrictions

Disallow blocks a path; Allow grants an exception within a blocked path. Both take a URL path starting with /. Empty Disallow: allows everything for that group. More specific (longer) rules win regardless of order in the file.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /search
Disallow: /cart
Disallow: /checkout

Sitemap

Absolute URL to an XML sitemap

One or more Sitemap lines declare sitemap locations. These are independent of User-agent groups and can appear anywhere in the file. Always use an absolute URL. Multiple sitemaps are valid for sites with separate product, blog, and image sitemaps.

Sitemap: https://www.example.com/sitemap.xml
Sitemap: https://www.example.com/sitemap-products.xml
Sitemap: https://www.example.com/sitemap-blog.xml

Crawl-delay

Throttling hint — unsupported by Google

Crawl-delay sets seconds between requests. Bing, Yandex, and Seznam honor it. Google ignores crawl-delay entirely and instead manages rate through Search Console settings, response status codes, and server response times. If Googlebot is overwhelming your origin, return 503 with a Retry-After header rather than adding crawl-delay.

Comments

Lines starting with #

Anything after # on a line is a comment. Useful for documenting why a rule exists so the next engineer does not remove it in a cleanup pass.

# Block faceted navigation parameters to save crawl budget
User-agent: *
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=

Size limit: Google fetches and parses the first 500 KiB of robots.txt. Anything past that byte boundary is ignored. Most sites never approach this; enterprise sites with thousands of rules should audit total file size and collapse patterns where possible.

Pattern matching and wildcards

Google, Bing, and Yandex all support two wildcard operators in robots.txt: the asterisk * matches any sequence of characters, and the dollar sign $ anchors a pattern to the end of the URL. These operators are powerful but deceptively fragile — a wildcard with the wrong prefix can block half your site in seconds.

The * wildcard

Matches zero or more characters

Use * to match any sequence. This is most useful for query parameters and dynamic path segments. Note that / in the URL is matched by * too.

User-agent: *
# Block every URL with a session ID parameter
Disallow: /*?sessionid=
Disallow: /*&sessionid=

# Block any path containing /preview/
Disallow: /*/preview/

# Block tracking parameters globally
Disallow: /*?utm_
Disallow: /*&utm_

The $ end-of-URL anchor

Restricts a pattern to URLs ending in the literal

$ marks the end of the URL string. Without it, a pattern matches URLs that start with the prefix and have anything afterward. This catches out anyone trying to block only PDFs or only a specific file extension.

User-agent: *
# Block every PDF anywhere on the site
Disallow: /*.pdf$

# Block query-stringless print pages but not print?query=true
Disallow: /*/print$

# Combine: block any URL ending in .xml except the sitemap
Disallow: /*.xml$
Allow: /sitemap.xml$

Always allow CSS and JavaScript

Blocked assets break rendering

Googlebot renders pages before indexing. If CSS or JavaScript is disallowed, Google sees an unstyled or broken version and may rank the page lower or classify it as cloaking. Explicitly allow asset paths your framework uses.

User-agent: *
Disallow: /internal/

# Always permit rendering assets
Allow: /*.css$
Allow: /*.js$
Allow: /_next/static/
Allow: /static/
Allow: /wp-content/themes/
Allow: /wp-includes/js/

Longest-match-wins precedence

Order in the file does not matter

When Disallow and Allow both match a URL, Google picks the rule with the longer character count. Wildcards are counted as a single character. This differs from regex engines and frequently surprises developers writing their first rules.

User-agent: *
Disallow: /products/       # 10 chars
Allow: /products/featured/ # 19 chars — WINS for /products/featured/shoe

Result: /products/shoe is blocked, /products/featured/shoe is allowed.

Test every rule: Google Search Console has a robots.txt Tester that validates rules against Googlebot. For other search and AI crawlers, the free SE Ranking robots.txt checker lets you test how different user-agents interpret your file. Always test before deploying — a single stray Disallow: / in production can drop every page from crawl within a day.

Meta robots tag values

The meta robots tag is a <meta> element placed in the <head> of an HTML document that instructs crawlers how to index and display the page. Unlike robots.txt, which only controls crawling, meta robots directly controls indexing behavior. Values can be combined in a comma-separated list and are case-insensitive.

The default and its negations

index, follow, noindex, nofollow

Without a meta robots tag, Google defaults to index, follow — the page can be indexed and links can pass PageRank. Adding noindex removes the page from the index; adding nofollow prevents link equity from flowing through any on-page links.

<!-- Default behavior, no tag needed -->
<meta name="robots" content="index, follow">

<!-- Remove from index, still crawl links -->
<meta name="robots" content="noindex, follow">

<!-- Do not pass PageRank through links -->
<meta name="robots" content="index, nofollow">

<!-- Full block from index and no link equity -->
<meta name="robots" content="noindex, nofollow">

Targeting specific crawlers

name attribute replaces robots

Replace name="robots" with a specific user-agent to target only that crawler. Conflicting directives use the most restrictive combination across all applicable tags.

<!-- Applies to all crawlers -->
<meta name="robots" content="noindex">

<!-- Googlebot only -->
<meta name="googlebot" content="noindex">

<!-- Block only Google AI training (Google Search still indexes) -->
<meta name="Google-Extended" content="noindex">

noarchive and nosnippet

Control what Google shows in results

noarchive prevents Google from offering a cached copy of the page. nosnippet hides the text snippet and the video preview in search results, leaving only the title and URL. Publishers use these to prevent content from appearing in AI Overviews while remaining indexable.

<!-- Index the page, no cached copy -->
<meta name="robots" content="noarchive">

<!-- Index the page, no snippet shown in results or AI features -->
<meta name="robots" content="nosnippet">

Granular snippet controls

max-snippet, max-image-preview, max-video-preview

These three directives give precise control over snippet length, image preview size, and video preview duration. max-snippet takes a character count (-1 for unlimited, 0 to disable). max-image-preview accepts none, standard, or large. max-video-preview takes seconds (-1 for unlimited, 0 for none).

<!-- Recommended for editorial content -->
<meta name="robots" content="max-snippet:-1, max-image-preview:large, max-video-preview:-1">

<!-- Restrict to short snippets and no large images -->
<meta name="robots" content="max-snippet:160, max-image-preview:standard">

unavailable_after

Scheduled de-indexing for time-limited content

Tells Google to drop the URL from the index after a specific RFC 850 date. Useful for time-limited offers, event pages, or seasonal campaigns. Google stops showing the URL in Search roughly 24 hours after the date passes.

<meta name="robots" content="unavailable_after: 2026-12-31T23:59:59+00:00">

Combining values: When multiple meta robots tags or values conflict, Google uses the most restrictive combination. noindex always wins over index; nofollow always wins over follow. For a full refresher on crawler behavior, see our how search engines work guide.

X-Robots-Tag HTTP header

X-Robots-Tag is an HTTP response header that carries the same directives as meta robots but works on any resource, not just HTML. Use it when you cannot add a meta tag: PDFs, images, videos, binary files, or responses generated by frameworks where modifying the HTML head is awkward. The header accepts the same values as meta robots and can also target specific crawlers.

Basic header syntax

Applies to the URL the header is returned on

X-Robots-Tag: noindex
X-Robots-Tag: noindex, nofollow
X-Robots-Tag: googlebot: noindex, nosnippet
X-Robots-Tag: unavailable_after: 2026-12-31T00:00:00+00:00

Apache — block PDFs from indexing

.htaccess rule with FilesMatch

<FilesMatch "\.pdf$">
  Header set X-Robots-Tag "noindex, nofollow"
</FilesMatch>

<FilesMatch "\.(doc|docx|xls|xlsx)$">
  Header set X-Robots-Tag "noindex"
</FilesMatch>

Nginx — block PDFs and staging subdomains

add_header in location or server blocks

# Block PDF indexing site-wide
location ~* \.pdf$ {
    add_header X-Robots-Tag "noindex, nofollow";
}

# Block the entire staging subdomain
server {
    server_name staging.example.com;
    add_header X-Robots-Tag "noindex, nofollow" always;
    # ... rest of config
}

Next.js — per-route headers in next.config

Send X-Robots-Tag from the framework

// next.config.ts
export default {
  async headers() {
    return [
      {
        source: "/internal/:path*",
        headers: [
          { key: "X-Robots-Tag", value: "noindex, nofollow" },
        ],
      },
      {
        source: "/(.*\\.pdf)",
        headers: [
          { key: "X-Robots-Tag", value: "noindex" },
        ],
      },
    ];
  },
};

PDFs often leak into Search: Marketing teams upload whitepapers, internal reports, and legal documents as PDFs without realising Google indexes them aggressively. If a PDF should not appear in Search, set X-Robots-Tag: noindex at the webserver layer. Meta tags do not exist on PDFs.

CMS-specific robots patterns

Every major CMS ships with a default robots configuration that works for the average site but rarely for yours. Below are the defaults and the overrides we apply in audits for WordPress, Shopify, Next.js app router, Webflow, and Squarespace.

WordPress

WordPress serves a virtual /robots.txt from wp-includes/functions.php unless a physical file exists. The default blocks /wp-admin/ and allows /wp-admin/admin-ajax.php. Yoast and Rank Math both let you edit robots.txt from the dashboard, which overrides the virtual file.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-login.php
Disallow: /?s=
Disallow: /search/
# Ensure assets remain crawlable
Allow: /wp-content/uploads/
Allow: /wp-content/themes/*.css$
Allow: /wp-content/themes/*.js$
Allow: /wp-includes/js/

Sitemap: https://www.example.com/sitemap_index.xml

Shopify

Shopify ships a locked default robots.txt that blocks cart, checkout, search, and a handful of Liquid endpoints. Since the 2021 update, merchants can customize via a robots.txt.liquid template. Common additions: block tag archives, collection sort parameters, and the /collections/vendors URL space.

{% comment %} robots.txt.liquid overrides {% endcomment %}
{% for group in robots.default_groups %}
{{- group.user_agent }}

{% for rule in group.rules %}
{{ rule }}
{% endfor %}

# Custom rules
Disallow: /collections/*+*
Disallow: /*?pf_t_
Disallow: /collections/vendors

{% if group.sitemap != blank %}
{{ group.sitemap }}
{% endif %}

{% endfor %}

Next.js app router

Next.js supports both a static public/robots.txt and a dynamic app/robots.ts file that exports Metadata API rules. The dynamic version is easier to version-control alongside sitemap generation.

// app/robots.ts
import type { MetadataRoute } from "next";

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: "*",
        allow: "/",
        disallow: ["/api/", "/dashboard/", "/auth/"],
      },
      {
        userAgent: "GPTBot",
        disallow: "/",
      },
    ],
    sitemap: "https://www.example.com/sitemap.xml",
    host: "https://www.example.com",
  };
}

Webflow

Webflow generates robots.txt from Project Settings, SEO tab, Indexing. Hosting on the webflow.io subdomain auto-generates Disallow: / so the preview domain never ranks. After connecting a custom domain, edit the robots.txt field to allow crawling and reference the CMS-generated sitemap at /sitemap.xml.

Squarespace

Squarespace manages robots.txt automatically and does not expose it for editing. Page-level noindex is controlled from SEO Settings on each page. For sitewide noindex during development, enable the Site-wide Noindex toggle in Home, Marketing, SEO. Remove it before launch — about 10% of Squarespace audits we run find this toggle still active months after go-live.

Staging domain hygiene: Every CMS supports a staging or preview environment. Apply sitewide noindex via meta robots, X-Robots-Tag, or HTTP basic auth on staging. Do not rely on robots.txt alone — disallowed URLs can still leak into Search via external links.

JavaScript-rendered pages and SPAs

Single-page apps built with React, Vue, Svelte, or Angular mount the UI client-side. Meta robots tags injected after hydration introduce subtle timing issues that can make noindex fail silently or take weeks longer to propagate. Understanding how Googlebot renders JavaScript is critical for SPAs.

The two-phase rendering model

Googlebot crawls then defers rendering

Googlebot fetches the raw HTML first and extracts links, canonicals, and the initial meta robots tag from the static markup. The page is then queued for rendering in a headless Chromium pool. Rendering can take seconds to days depending on load. If the meta robots tag only exists after JavaScript-execution, the first-pass indexing decision is made from the initial HTML — usually with index, follow by default.

SSR and the reliable pattern

Emit meta robots from the server

Server-rendered frameworks like Next.js, Nuxt, Remix, and SvelteKit include meta robots in the initial HTML response. This is the only reliable way to signal noindex to Googlebot.

// Next.js app router — per-route noindex
export const metadata = {
  robots: {
    index: false,
    follow: true,
    nocache: true,
  },
};

// Produces in the HTML head:
// <meta name="robots" content="noindex,nocache" />

Client-only SPAs and document.title patterns

Risks when meta robots is set via JavaScript

Libraries like React Helmet or Vue Head mutate the document head after mount. Googlebot does honor the rendered-HTML meta robots eventually, but the first crawl uses raw HTML. If your site only adds noindex via JavaScript, expect a lag of days or weeks before pages leave the index.

// React Helmet — runs only after client mount
import { Helmet } from "react-helmet-async";

function PrivatePage() {
  return (
    <>
      <Helmet>
        <meta name="robots" content="noindex" />
      </Helmet>
      <Dashboard />
    </>
  );
}

// Raw HTML initially has no meta robots tag.
// Googlebot indexes the route on first crawl, then updates later.

Soft 404 risk in SPAs

Client-rendered error states frequently misfire

A classic SPA bug: the server returns 200 for every route and the client decides whether to show content or an error. If a user hits a deleted URL, the client renders a friendly "Not found" message but the HTTP status is still 200. Google flags these as soft 404s and wastes crawl budget. Fix by routing unknown paths to a URL that genuinely returns 404, or by pre-rendering known routes and returning proper status codes server-side. See our HTTP status codes reference for the full breakdown.

Verify with URL Inspection: Search Console URL Inspection shows the rendered HTML Google actually sees. Compare View crawled page with View tested page to confirm your meta robots tag survives rendering. If the rendered HTML lacks the tag, move the directive server-side.

Canonical, noindex, and blocking strategies

Robots.txt disallow, meta noindex, and rel=canonical each solve different problems. Combining them incorrectly is the most common cause of indexing regressions we find during client audits. The table below shows when to use each mechanism and what fails when they are mixed carelessly.

The disallow + noindex conflict

Disallowed URLs can never show noindex

If robots.txt blocks a URL, Googlebot cannot fetch the page, which means it cannot read the meta robots tag. The URL remains eligible for indexing via external links and shows up in Search with no snippet. To guarantee a URL is de-indexed, allow crawling and apply noindex via meta tag or X-Robots-Tag. Once the page is removed from the index, disallow can be added back if crawl budget is a concern.

Canonical vs noindex

Consolidation vs removal

rel=canonical tells Google that two or more URLs represent the same content and that the canonical URL should be the one indexed. The non-canonical versions remain crawlable and pass signals. noindex removes the page from the index entirely. Use canonical for duplicate URLs (parameters, print versions, mobile/desktop splits) and noindex for pages that should never rank (internal search results, user dashboards, thin tag archives).

The correct de-indexing sequence

Step-by-step removal playbook

When a URL is currently indexed and must be removed, follow this sequence.

Ensure robots.txt does not disallow the URL. Googlebot must be able to crawl the page to read any noindex directive.
Add meta robots content="noindex" or X-Robots-Tag: noindex to the response.
Submit the URL via Search Console Removals for a six-month temporary hide while the noindex propagates.
Wait for Googlebot to re-crawl. Confirm removal in URL Inspection — the Indexing state should switch to "Excluded by noindex tag."
Once de-indexed, optionally add the URL to robots.txt disallow to save future crawl budget. Do not add disallow before the page is out of the index.

Blocking AI crawlers in 2026

Google-Extended, GPTBot, ClaudeBot, CCBot, PerplexityBot

Every major AI provider now operates a dedicated crawler for training data collection. Google-Extended controls Google Gemini and Vertex AI training and is separate from Googlebot — blocking Google-Extended does not affect Search. OpenAI operates GPTBot, Common Crawl operates CCBot, Anthropic operates ClaudeBot, Perplexity operates PerplexityBot, and Apple operates Applebot-Extended for Apple Intelligence. All respect robots.txt. AI Overviews in Google Search are powered by Google-Extended-trained models plus live retrieval, so nosnippet also matters for controlling what surfaces.

# Block all major AI training crawlers
User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Keep normal search indexing open
User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

AI Overviews and nosnippet: Google has confirmed that pages with meta robots nosnippet will not appear as sources in AI Overviews. If you want to stay indexed for Search but out of AI answer boxes, apply nosnippet rather than noindex. For broader AI-blocking strategy, combine nosnippet in meta tags with Google-Extended and GPTBot disallow in robots.txt, and cross-reference our technical SEO audit checklist and SEO glossary for supporting terminology.

Audit Crawl and Index Control

Misconfigured robots.txt and stale noindex tags silently drain traffic. Our technical SEO audits catch them, and we ship the fixes alongside ongoing indexing governance.

Get Started Explore SEO Optimization

Free consultation

Expert guidance

Tailored solutions