Complete Guide to Blocking AI Crawlers
As AI companies train their models on web content, many website owners want to control how their work is used. This guide explains every AI crawler, what they do, and how to block them.
What Are AI Crawlers?
AI crawlers are automated bots that visit websites to collect content for training artificial intelligence models. Unlike traditional search engine crawlers (like Googlebot) that index pages so users can find them, AI crawlers scrape content to teach language models how to write, code, and answer questions.
When you interact with ChatGPT, Claude, or Gemini, the AI's knowledge comes partly from content collected by these crawlers. Your blog posts, documentation, creative writing, and code examples may have been used to train these models—often without explicit consent or compensation.
AI Crawlers by Company
OpenAI (ChatGPT, GPT-4)
| Crawler | Purpose | Recommendation |
|---|---|---|
GPTBot | Collects content for training future GPT models | Block to prevent training |
ChatGPT-User | Real-time browsing when users ask about current events | Allow if you want ChatGPT citations |
OAI-SearchBot | Powers OpenAI's search features | Allow for search visibility |
Anthropic (Claude)
| Crawler | Purpose | Recommendation |
|---|---|---|
ClaudeBot | Training data collection for Claude models | Block to prevent training |
anthropic-ai | General Anthropic crawler | Block alongside ClaudeBot |
Claude-Web | Claude's web browsing feature | Allow if you want Claude citations |
Google (Gemini, Bard)
| Crawler | Purpose | Recommendation |
|---|---|---|
Google-Extended | AI training for Gemini/Bard products | Block to prevent AI training |
Googlebot | Search indexing (SEO) | Never block - for search rankings |
Meta (Llama, Facebook AI)
| Crawler | Purpose | Recommendation |
|---|---|---|
Meta-ExternalAgent | Training data for Llama and Meta AI | Block to prevent training |
Meta-ExternalFetcher | Link previews in Facebook/Instagram | Usually allow for social sharing |
FacebookBot | General Meta crawling | Consider blocking |
Other Notable AI Crawlers
| Crawler | Company | Purpose | Action |
|---|---|---|---|
Applebot-Extended | Apple | Apple AI training | Block |
PerplexityBot | Perplexity | AI-powered search engine | Allow |
cohere-ai | Cohere | LLM training | Block |
Bytespider | ByteDance (TikTok) | Aggressive crawler, suspected AI training | Block |
CCBot | Common Crawl | Public dataset used by many AI companies | Block |
Amazonbot | Amazon | Alexa and Amazon AI | Block |
Diffbot | Diffbot | Web data extraction for AI | Block |
Understanding Crawler Categories
Training Crawlers
Collect data to train AI models. Once used, your content becomes part of the model's "knowledge"—there's no way to remove it.
Examples: GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent
Browsing Crawlers
Access your site in real-time when users ask AI assistants questions. Similar to how a human would browse.
Examples: ChatGPT-User, Claude-Web, PerplexityBot
Dataset Crawlers
Build large public archives that AI companies use for training. Blocking prevents your content from entering these datasets.
Examples: CCBot (Common Crawl), Diffbot, Omgilibot
Aggressive Crawlers
Known for high request rates, poor rate limiting, or unclear data usage policies. Worth blocking regardless of AI concerns.
Examples: Bytespider, FriendlyCrawler, ISSCyberRiskCrawler
How to Block AI Crawlers
Method 1: robots.txt Recommended
Add rules to your robots.txt file at your website root:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# Allow search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
Generate Your robots.txt AutomaticallyMethod 2: Meta Tags
Add to your HTML <head> for page-level control:
<meta name="robots" content="noai, noimageai">Note: Not all AI crawlers respect meta tags. OpenAI has proposed noai and noimageai directives, but adoption varies.
Method 3: HTTP Headers
Configure your server to send headers:
X-Robots-Tag: noai, noimageaiMethod 4: Rate Limiting
Implement aggressive rate limiting for known AI user-agents:
# Nginx example
if ($http_user_agent ~* (GPTBot|ClaudeBot|CCBot)) {
set $limit_rate 10k;
}Common Questions
grep -E "(GPTBot|ClaudeBot|CCBot|Google-Extended)" access.logThe Balanced Approach
Most website owners want to:
- Prevent their content from training AI models (no compensation, no control)
- Allow AI assistants to cite them in answers (visibility, traffic)
Recommended Configuration:
GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent
ChatGPT-User, Claude-Web, PerplexityBot
Googlebot, Bingbot (for SEO)
Use the "Block Training, Allow Browsing" preset
Key Takeaways
- AI training and AI browsing are different—you can block one while allowing the other
- Blocking Google-Extended doesn't affect your Google Search rankings
- robots.txt is respected by major companies but isn't a guarantee
- You can only prevent future use—already-trained models can't be "untrained"