How to Block AI Crawlers: A Complete Guide

AI companies are crawling the web to train their models on publicly accessible content. Whether you want to prevent that, allow it selectively, or just understand what's happening to your site—this guide covers everything you need to know.

The Current Crawler Landscape

AI companies need text to train language models. They get it several ways: by crawling the web directly with their own bots, by licensing curated datasets, and by using public archives like Common Crawl. Your site is probably being crawled by multiple AI bots right now, and most of their traffic doesn't show up in JavaScript-based analytics tools like Google Analytics—you need raw server logs to see them.

The number of distinct AI crawlers grew from a handful in 2022 to dozens by 2026. The major ones are documented and respect robots.txt. Many smaller ones aren't documented and don't. The robots.txt approach is a meaningful signal to the companies that respect it—and those happen to be the ones training the most widely deployed models.

Training Bots vs. Browsing Bots

This distinction matters for deciding what to block. They have different purposes and different effects on your site.

Training Crawlers

Collect your content to train AI models. Once your content is used in a training run, it influences the model permanently—there's no retrieval or citation, it's absorbed into the weights.

Examples: GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent, Amazonbot, Applebot-Extended

Browsing Crawlers

Access your site in real-time when a user asks an AI assistant a question about current information. Similar to a user clicking your link. May cite your content and send referral traffic.

Examples: ChatGPT-User, Claude-Web, PerplexityBot, OAI-SearchBot

Most site owners want to block training crawlers while allowing browsing crawlers. Blocking browsing crawlers means AI assistants can't access your current content to answer questions—you lose the citation traffic. Allowing them means your content might be summarized in AI answers, which some view as competition for their own traffic.

All Known AI Crawlers

OpenAI (ChatGPT, GPT-4)

User-Agent	Purpose	Recommendation
`GPTBot`	Training data collection for GPT models	Block to prevent training
`ChatGPT-User`	Real-time browsing for user queries	Allow for ChatGPT citations
`OAI-SearchBot`	OpenAI search and index features	Allow for search visibility

Anthropic (Claude)

User-Agent	Purpose	Recommendation
`ClaudeBot`	Training data for Claude models	Block
`anthropic-ai`	General Anthropic crawling	Block alongside ClaudeBot
`Claude-Web`	Real-time browsing for Claude users	Allow for Claude citations

Google (Gemini)

User-Agent	Purpose	Recommendation
`Google-Extended`	AI training for Gemini and AI Overview features	Block
`Googlebot`	Web search indexing (SEO)	Never block
`Google-CloudVertexBot`	Vertex AI training data	Block

Critical: Google-Extended and Googlebot are completely separate. Blocking Google-Extended has no effect on your Google Search rankings.

Meta (Llama)

User-Agent	Purpose	Recommendation
`Meta-ExternalAgent`	Training data for Llama and Meta AI	Block
`Meta-ExternalFetcher`	Link previews in Facebook/Instagram	Usually allow

Other notable crawlers

User-Agent	Company	Purpose	Action
`CCBot`	Common Crawl	Public dataset used by many AI companies for training	Block
`Bytespider`	ByteDance (TikTok)	Aggressive crawler, suspected AI training	Block
`Applebot-Extended`	Apple	Apple Intelligence training	Block
`Amazonbot`	Amazon	Alexa and Amazon AI training	Block
`PerplexityBot`	Perplexity	AI-powered search engine (browsing, not training)	Allow
`cohere-ai`	Cohere	LLM training data	Block
`Diffbot`	Diffbot	Web data extraction for AI applications	Block
`ImagesiftBot`	ImageSift	Image training data collection	Block
`Omgilibot`	Omgili/Webz.io	Data collection for AI training datasets	Block

robots.txt: The Primary Method

Your robots.txt file lives at the root of your domain: https://yoursite.com/robots.txt. Every compliant crawler checks this file before visiting any other page on your site. This is the most reliable method for controlling AI crawlers that have committed to respecting it.

Block all training crawlers

# Block AI training crawlers User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Google-Extended Disallow: / User-agent: Google-CloudVertexBot Disallow: / User-agent: CCBot Disallow: / User-agent: Meta-ExternalAgent Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Amazonbot Disallow: / User-agent: Bytespider Disallow: / User-agent: cohere-ai Disallow: / User-agent: Diffbot Disallow: / Allow browsing bots (can cite your content) User-agent: ChatGPT-User Allow: / User-agent: Claude-Web Allow: / User-agent: PerplexityBot Allow: / User-agent: OAI-SearchBot Allow: / Always allow search engines User-agent: Googlebot Allow: /

User-agent: Bingbot Allow: /

Selective blocking (specific directories only)

# Block AI training bots from premium content
User-agent: GPTBot
Disallow: /premium/
Disallow: /courses/
Disallow: /members/
Allow access to public pages
User-agent: GPTBot
Allow: /blog/
Allow: /about/

Block everything with an allowlist

# Deny all by default, allow only known search engines
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: DuckDuckBot
Allow: /

The nuclear option. This blocks all unlisted bots including future AI crawlers you haven't heard of yet. Only search engines you explicitly allow can crawl your site.

Generate your robots.txt automatically: The ToolsDock robots.txt generator lets you select which crawlers to block and generates a correctly formatted file with all current user-agent strings.

Meta Tags and HTTP Headers

Meta tags provide page-level control that's separate from robots.txt. Useful when you want to allow some pages but block specific content pages from AI training.

Meta robots tag

<!-- In your HTML <head> -->
<!-- Standard: prevent indexing and following links -->
<meta name="robots" content="noindex, nofollow">
<!-- OpenAI's proposed noai directives -->
<meta name="robots" content="noai, noimageai">
<!-- Combined -->
<meta name="robots" content="noindex, noai, noimageai">
<!-- Target specific bots -->
<meta name="GPTBot" content="noindex">

Adoption of noai and noimageai across AI companies varies. These tags are a supplement to robots.txt, not a replacement. If a bot is already blocked in robots.txt, it shouldn't be fetching pages to read meta tags anyway.

HTTP response headers

# X-Robots-Tag header (equivalent to meta robots tag, but via HTTP)
# Add to your server configuration:
X-Robots-Tag: noai, noimageai
Or combine with other directives:
X-Robots-Tag: noindex, noai

HTTP headers work for any content type—not just HTML. A PDF served with X-Robots-Tag: noai communicates the restriction even though PDFs can't contain meta tags. The challenge: bots must make the request to receive the header, so this doesn't prevent the network request the way robots.txt does.

Server Configuration

nginx

# nginx.conf — block AI training bots with 403
map $http_user_agent $is_ai_bot {
default         0;
"~GPTBot"       1;
"~ClaudeBot"    1;
"~anthropic-ai" 1;
"~Google-Extended" 1;
"~CCBot"        1;
"~Bytespider"   1;
"~Meta-ExternalAgent" 1;
"~Amazonbot"    1;
}
server {
# ... other config
if ($is_ai_bot) {
return 403;
}
}
Alternative: rate limit instead of block
This slows bots without completely refusing
limit_req_zone $http_user_agent zone=bots:10m rate=1r/m;
location / {
if ($is_ai_bot) {
limit_req zone=bots burst=2 nodelay;
}
}

Apache .htaccess

# .htaccess — block AI training bots
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} anthropic-ai [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Meta-ExternalAgent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Amazonbot [NC]
RewriteRule .* - [F,L]

Cloudflare (WAF rules)

If you use Cloudflare, you can create WAF custom rules to block AI bots at the edge before they reach your server. Go to Security → WAF → Custom Rules and create a rule matching (http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") etc., with action "Block." Cloudflare also has managed bot lists that include known AI crawlers.

Server-side blocking vs. robots.txt

Server-side blocking (nginx, .htaccess, WAF) returns an HTTP error to the bot. This is technically stronger than robots.txt—the bot can't read the content even if it ignores the robots.txt directive. The downside: it consumes server resources to handle and respond to the request. robots.txt blocks before the request is made by compliant bots. For maximum protection, use both.

Monitoring Bot Traffic

Web analytics tools that rely on JavaScript (Google Analytics, Plausible, Fathom) don't capture bot traffic. Bots don't execute JavaScript. Raw server access logs are the only reliable way to see what's visiting your site.

Analyzing access logs

# Find all AI crawler hits in nginx access log
grep -E "(GPTBot|ClaudeBot|anthropic-ai|Google-Extended|CCBot|Bytespider)" \
  /var/log/nginx/access.log
Count by crawler type
grep -oE "(GPTBot|ClaudeBot|anthropic-ai|Google-Extended|CCBot|Bytespider)" 

/var/log/nginx/access.log | sort | uniq -c | sort -rn
Show full log lines for GPTBot requests
grep "GPTBot" /var/log/nginx/access.log | tail -50
Count requests per day for AI bots
grep "GPTBot" /var/log/nginx/access.log | 

awk '{print $4}' | cut -d: -f1 | sort | uniq -c

What to look for

Volume: Some bots are aggressive. Bytespider in particular has been reported crawling thousands of pages per hour on small sites. High bot traffic wastes bandwidth and server resources.
IP ranges: You can verify a bot's identity by checking if the request IP belongs to the company's documented IP ranges. OpenAI, Google, and others publish their crawler IP lists.
Compliance: After adding robots.txt blocks, check the logs a few days later to confirm the named bots stopped visiting. If they didn't, they're not honoring robots.txt.
New bots: Periodically check for unfamiliar user-agent strings. New AI crawlers emerge regularly.

What Actually Works

There's a spectrum of effectiveness. Here's an honest assessment:

Method	Effectiveness	Against whom
robots.txt with named user-agents	High	Major AI companies that have committed to compliance (OpenAI, Anthropic, Google, Meta, Apple, Amazon)
Server-level blocking (nginx, .htaccess)	High	Any bot using its documented user-agent string
Meta noai/noimageai tags	Moderate	OpenAI (committed), some others; adoption still inconsistent
robots.txt wildcard block (User-agent: *)	Moderate	Polite bots that check robots.txt; ignored by scrapers that don't
IP blocking	Moderate	Bots using documented IP ranges; ineffective against distributed or rotating IP bots
Paywalls / login walls	High	Any automated crawler that doesn't have credentials
CAPTCHA challenges	Moderate	Bots without CAPTCHA-solving capability; increasingly less effective

What doesn't work: anything that relies on bots voluntarily identifying themselves honestly. A bad actor can trivially set any user-agent string. The methods above only work against bots that (a) use their documented user-agent and (b) respect the response. For major AI companies, both conditions hold. For smaller or undisclosed scrapers, neither may.

The hardest problem: undisclosed scraping

Some content ends up in training datasets through intermediaries—scraped via Common Crawl, licensed from data brokers, or collected by companies that don't disclose their AI use. You can block CCBot to prevent the Common Crawl path, but the broker and licensing paths have no technical countermeasure. This is where legal approaches become relevant.

Legal Landscape

The legal situation around AI training data is genuinely unsettled. Several lawsuits are in various stages in US and EU courts as of early 2026. Here's the current picture without legal advice:

Copyright arguments

Several class-action lawsuits from authors, news publishers, and coders argue that scraping copyrighted content for training violates copyright law. The AI companies generally argue training constitutes "fair use" under US law. Courts haven't issued definitive rulings yet. The New York Times case against OpenAI and the Authors Guild cases are the most watched.

Terms of service

Your website's terms of service can explicitly prohibit scraping for AI training. This has legal effect—violating enforceable ToS can constitute breach of contract. It won't stop bad actors technically, but it strengthens any legal claim if your content appears in a model trained after the prohibition was published.

# Sample ToS language (get this reviewed by a lawyer):
"Automated scraping, crawling, or systematic downloading of
content from this site is prohibited without prior written
consent. Use of any content from this site to train,
fine-tune, or evaluate artificial intelligence models is
prohibited."

EU AI Act

The EU AI Act (fully effective 2026) requires AI providers using copyright-protected material for training to provide transparency about what data was used and honor opt-outs. This creates an explicit opt-out right for EU-based rights holders. How it will work in practice is still being determined through implementing regulations.

What robots.txt means legally

robots.txt has no inherent legal status—it's a voluntary convention. However, some legal arguments treat robots.txt as a technical access restriction, and violating it (by a bot that reads and then ignores it) may be relevant to computer fraud statutes in some jurisdictions. Legal scholars disagree on this. It's not a reliable legal mechanism on its own.

The Balanced Approach

Most site owners land in the same place: block training, allow citation. The configuration below implements this:

# robots.txt — Block training, allow AI browsing/citations

# ===== BLOCK: AI Training Crawlers =====
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Google-CloudVertexBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: cohere-ai
Disallow: /

# ===== ALLOW: AI Browsing/Citation Bots =====
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

# ===== ALWAYS ALLOW: Search Engines =====
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: DuckDuckBot
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Generate a correctly formatted version for your site—including the most up-to-date list of user-agents—with the ToolsDock robots.txt generator.

After setting up robots.txt

Wait 48–72 hours, then check your server logs to verify named bots have stopped visiting
Monitor quarterly for new AI crawler user-agents—the field is still growing rapidly
Add ToS language prohibiting AI training use if you want a legal paper trail
Consider the EU opt-out mechanisms as they become available for cross-border protection

Frequently Asked Questions

No. Google explicitly created a separate user-agent (Google-Extended) for AI training, completely independent from Googlebot. Blocking Google-Extended has zero effect on search indexing. The same applies to every other AI training crawler—they're distinct from the search bots that affect your rankings.

GPTBot collects content for training future GPT models. Once used, your content is embedded in the model's weights permanently. ChatGPT-User is for real-time browsing when someone asks ChatGPT a question—it accesses your site the way a user would. Blocking GPTBot prevents training data collection. Blocking ChatGPT-User means ChatGPT can't cite your content in live answers.

The major ones do: OpenAI, Anthropic, Google, Meta, Apple, and Amazon have all publicly committed to respecting robots.txt for their documented AI crawlers. But robots.txt is a voluntary convention—it's a request, not a technical enforcement mechanism. Smaller scrapers, undisclosed bots, and companies without a public commitment may ignore it entirely.

No. Once content is used for training, it's transformed into model weights through a mathematical process that can't be reversed by deleting source data. There's no database entry to remove. Blocking crawlers only prevents your content from being included in future training runs.

Common Crawl is a nonprofit that archives the entire web and publishes the snapshots as free datasets. Many AI companies—including those that trained major LLMs—use Common Crawl data extensively instead of crawling the web themselves. Blocking CCBot prevents your content from entering these widely-distributed training datasets.

Check your web server access logs for User-Agent strings. AI crawlers use distinctive identifiers. On Linux: grep -E '(GPTBot|ClaudeBot|CCBot|Google-Extended|Bytespider|anthropic-ai)' /var/log/nginx/access.log. Analytics tools like Google Analytics don't capture bot traffic because bots don't execute JavaScript. Raw server logs are the only reliable source.

Most site owners want to block training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) while allowing browsing crawlers (ChatGPT-User, Claude-Web, PerplexityBot). This prevents your content from training models while allowing AI assistants to cite your content when users ask questions—potentially driving referral traffic.

OpenAI proposed for page-level control without modifying robots.txt. Adoption varies—not all AI companies have committed to these tags. robots.txt remains more reliable because all major AI companies have explicitly committed to honoring it.

Bytespider is ByteDance's (TikTok parent company) crawler. It's known for aggressive crawling—high request rates, poor rate limiting, and unclear disclosure about data usage. Regardless of AI concerns, many webmasters block it simply because of the server load it creates. Block it unless you have a specific reason to want ByteDance crawling your content.

Review quarterly at minimum. New AI crawlers emerge regularly as more companies build LLMs, and existing companies add new specialized crawlers. Set a calendar reminder and check the robots.txt documentation from major AI companies for new user-agent announcements.