How to Block AI Crawlers: A Complete Guide
AI companies are crawling the web to train their models on publicly accessible content. Whether you want to prevent that, allow it selectively, or just understand what's happening to your site—this guide covers everything you need to know.
The Current Crawler Landscape
AI companies need text to train language models. They get it several ways: by crawling the web directly with their own bots, by licensing curated datasets, and by using public archives like Common Crawl. Your site is probably being crawled by multiple AI bots right now, and most of their traffic doesn't show up in JavaScript-based analytics tools like Google Analytics—you need raw server logs to see them.
The number of distinct AI crawlers grew from a handful in 2022 to dozens by 2026. The major ones are documented and respect robots.txt. Many smaller ones aren't documented and don't. The robots.txt approach is a meaningful signal to the companies that respect it—and those happen to be the ones training the most widely deployed models.
Training Bots vs. Browsing Bots
This distinction matters for deciding what to block. They have different purposes and different effects on your site.
Collect your content to train AI models. Once your content is used in a training run, it influences the model permanently—there's no retrieval or citation, it's absorbed into the weights.
Examples: GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent, Amazonbot, Applebot-Extended
Access your site in real-time when a user asks an AI assistant a question about current information. Similar to a user clicking your link. May cite your content and send referral traffic.
Examples: ChatGPT-User, Claude-Web, PerplexityBot, OAI-SearchBot
Most site owners want to block training crawlers while allowing browsing crawlers. Blocking browsing crawlers means AI assistants can't access your current content to answer questions—you lose the citation traffic. Allowing them means your content might be summarized in AI answers, which some view as competition for their own traffic.
All Known AI Crawlers
OpenAI (ChatGPT, GPT-4)
| User-Agent | Purpose | Recommendation |
|---|---|---|
GPTBot |
Training data collection for GPT models | Block to prevent training |
ChatGPT-User |
Real-time browsing for user queries | Allow for ChatGPT citations |
OAI-SearchBot |
OpenAI search and index features | Allow for search visibility |
Anthropic (Claude)
| User-Agent | Purpose | Recommendation |
|---|---|---|
ClaudeBot |
Training data for Claude models | Block |
anthropic-ai |
General Anthropic crawling | Block alongside ClaudeBot |
Claude-Web |
Real-time browsing for Claude users | Allow for Claude citations |
Google (Gemini)
| User-Agent | Purpose | Recommendation |
|---|---|---|
Google-Extended |
AI training for Gemini and AI Overview features | Block |
Googlebot |
Web search indexing (SEO) | Never block |
Google-CloudVertexBot |
Vertex AI training data | Block |
Critical: Google-Extended and Googlebot are completely separate. Blocking Google-Extended has no effect on your Google Search rankings.
Meta (Llama)
| User-Agent | Purpose | Recommendation |
|---|---|---|
Meta-ExternalAgent |
Training data for Llama and Meta AI | Block |
Meta-ExternalFetcher |
Link previews in Facebook/Instagram | Usually allow |
Other notable crawlers
| User-Agent | Company | Purpose | Action |
|---|---|---|---|
CCBot | Common Crawl | Public dataset used by many AI companies for training | Block |
Bytespider | ByteDance (TikTok) | Aggressive crawler, suspected AI training | Block |
Applebot-Extended | Apple | Apple Intelligence training | Block |
Amazonbot | Amazon | Alexa and Amazon AI training | Block |
PerplexityBot | Perplexity | AI-powered search engine (browsing, not training) | Allow |
cohere-ai | Cohere | LLM training data | Block |
Diffbot | Diffbot | Web data extraction for AI applications | Block |
ImagesiftBot | ImageSift | Image training data collection | Block |
Omgilibot | Omgili/Webz.io | Data collection for AI training datasets | Block |
robots.txt: The Primary Method
Your robots.txt file lives at the root of your domain: https://yoursite.com/robots.txt. Every compliant crawler checks this file before visiting any other page on your site. This is the most reliable method for controlling AI crawlers that have committed to respecting it.
Block all training crawlers
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Google-CloudVertexBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
Allow browsing bots (can cite your content)
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
Always allow search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
Selective blocking (specific directories only)
# Block AI training bots from premium content
User-agent: GPTBot
Disallow: /premium/
Disallow: /courses/
Disallow: /members/
Allow access to public pages
User-agent: GPTBot
Allow: /blog/
Allow: /about/
Block everything with an allowlist
# Deny all by default, allow only known search engines
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: DuckDuckBot
Allow: /
The nuclear option. This blocks all unlisted bots including future AI crawlers you haven't heard of yet. Only search engines you explicitly allow can crawl your site.
Server Configuration
nginx
# nginx.conf — block AI training bots with 403
map $http_user_agent $is_ai_bot {
default 0;
"~GPTBot" 1;
"~ClaudeBot" 1;
"~anthropic-ai" 1;
"~Google-Extended" 1;
"~CCBot" 1;
"~Bytespider" 1;
"~Meta-ExternalAgent" 1;
"~Amazonbot" 1;
}
server {
# ... other config
if ($is_ai_bot) {
return 403;
}
}
Alternative: rate limit instead of block
This slows bots without completely refusing
limit_req_zone $http_user_agent zone=bots:10m rate=1r/m;
location / {
if ($is_ai_bot) {
limit_req zone=bots burst=2 nodelay;
}
}
Apache .htaccess
# .htaccess — block AI training bots
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} anthropic-ai [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Meta-ExternalAgent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Amazonbot [NC]
RewriteRule .* - [F,L]
Cloudflare (WAF rules)
If you use Cloudflare, you can create WAF custom rules to block AI bots at the edge before they reach your server. Go to Security → WAF → Custom Rules and create a rule matching (http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") etc., with action "Block." Cloudflare also has managed bot lists that include known AI crawlers.
Server-side blocking vs. robots.txt
Server-side blocking (nginx, .htaccess, WAF) returns an HTTP error to the bot. This is technically stronger than robots.txt—the bot can't read the content even if it ignores the robots.txt directive. The downside: it consumes server resources to handle and respond to the request. robots.txt blocks before the request is made by compliant bots. For maximum protection, use both.
Monitoring Bot Traffic
Web analytics tools that rely on JavaScript (Google Analytics, Plausible, Fathom) don't capture bot traffic. Bots don't execute JavaScript. Raw server access logs are the only reliable way to see what's visiting your site.
Analyzing access logs
# Find all AI crawler hits in nginx access log
grep -E "(GPTBot|ClaudeBot|anthropic-ai|Google-Extended|CCBot|Bytespider)" \
/var/log/nginx/access.log
Count by crawler type
grep -oE "(GPTBot|ClaudeBot|anthropic-ai|Google-Extended|CCBot|Bytespider)"
/var/log/nginx/access.log | sort | uniq -c | sort -rn
Show full log lines for GPTBot requests
grep "GPTBot" /var/log/nginx/access.log | tail -50
Count requests per day for AI bots
grep "GPTBot" /var/log/nginx/access.log |
awk '{print $4}' | cut -d: -f1 | sort | uniq -c
What to look for
- Volume: Some bots are aggressive. Bytespider in particular has been reported crawling thousands of pages per hour on small sites. High bot traffic wastes bandwidth and server resources.
- IP ranges: You can verify a bot's identity by checking if the request IP belongs to the company's documented IP ranges. OpenAI, Google, and others publish their crawler IP lists.
- Compliance: After adding robots.txt blocks, check the logs a few days later to confirm the named bots stopped visiting. If they didn't, they're not honoring robots.txt.
- New bots: Periodically check for unfamiliar user-agent strings. New AI crawlers emerge regularly.
What Actually Works
There's a spectrum of effectiveness. Here's an honest assessment:
| Method | Effectiveness | Against whom |
|---|---|---|
| robots.txt with named user-agents | High | Major AI companies that have committed to compliance (OpenAI, Anthropic, Google, Meta, Apple, Amazon) |
| Server-level blocking (nginx, .htaccess) | High | Any bot using its documented user-agent string |
| Meta noai/noimageai tags | Moderate | OpenAI (committed), some others; adoption still inconsistent |
| robots.txt wildcard block (User-agent: *) | Moderate | Polite bots that check robots.txt; ignored by scrapers that don't |
| IP blocking | Moderate | Bots using documented IP ranges; ineffective against distributed or rotating IP bots |
| Paywalls / login walls | High | Any automated crawler that doesn't have credentials |
| CAPTCHA challenges | Moderate | Bots without CAPTCHA-solving capability; increasingly less effective |
What doesn't work: anything that relies on bots voluntarily identifying themselves honestly. A bad actor can trivially set any user-agent string. The methods above only work against bots that (a) use their documented user-agent and (b) respect the response. For major AI companies, both conditions hold. For smaller or undisclosed scrapers, neither may.
The hardest problem: undisclosed scraping
Some content ends up in training datasets through intermediaries—scraped via Common Crawl, licensed from data brokers, or collected by companies that don't disclose their AI use. You can block CCBot to prevent the Common Crawl path, but the broker and licensing paths have no technical countermeasure. This is where legal approaches become relevant.
Legal Landscape
The legal situation around AI training data is genuinely unsettled. Several lawsuits are in various stages in US and EU courts as of early 2026. Here's the current picture without legal advice:
Copyright arguments
Several class-action lawsuits from authors, news publishers, and coders argue that scraping copyrighted content for training violates copyright law. The AI companies generally argue training constitutes "fair use" under US law. Courts haven't issued definitive rulings yet. The New York Times case against OpenAI and the Authors Guild cases are the most watched.
Terms of service
Your website's terms of service can explicitly prohibit scraping for AI training. This has legal effect—violating enforceable ToS can constitute breach of contract. It won't stop bad actors technically, but it strengthens any legal claim if your content appears in a model trained after the prohibition was published.
# Sample ToS language (get this reviewed by a lawyer):
"Automated scraping, crawling, or systematic downloading of
content from this site is prohibited without prior written
consent. Use of any content from this site to train,
fine-tune, or evaluate artificial intelligence models is
prohibited."
EU AI Act
The EU AI Act (fully effective 2026) requires AI providers using copyright-protected material for training to provide transparency about what data was used and honor opt-outs. This creates an explicit opt-out right for EU-based rights holders. How it will work in practice is still being determined through implementing regulations.
What robots.txt means legally
robots.txt has no inherent legal status—it's a voluntary convention. However, some legal arguments treat robots.txt as a technical access restriction, and violating it (by a bot that reads and then ignores it) may be relevant to computer fraud statutes in some jurisdictions. Legal scholars disagree on this. It's not a reliable legal mechanism on its own.
The Balanced Approach
Most site owners land in the same place: block training, allow citation. The configuration below implements this:
# robots.txt — Block training, allow AI browsing/citations
# ===== BLOCK: AI Training Crawlers =====
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Google-CloudVertexBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: cohere-ai
Disallow: /
# ===== ALLOW: AI Browsing/Citation Bots =====
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
# ===== ALWAYS ALLOW: Search Engines =====
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: DuckDuckBot
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Generate a correctly formatted version for your site—including the most up-to-date list of user-agents—with the ToolsDock robots.txt generator.
After setting up robots.txt
- Wait 48–72 hours, then check your server logs to verify named bots have stopped visiting
- Monitor quarterly for new AI crawler user-agents—the field is still growing rapidly
- Add ToS language prohibiting AI training use if you want a legal paper trail
- Consider the EU opt-out mechanisms as they become available for cross-border protection