Complete Guide to Blocking AI Crawlers

Q: What about the noai and noimageai meta tags?

OpenAI proposed these meta tag directives ( ) for page-level control. However, adoption varies across AI companies, and not all crawlers respect them. robots.txt remains the more reliable method since major companies have committed to honoring it.

12 min read Last updated: January 2025

As AI companies train their models on web content, many website owners want to control how their work is used. This guide explains every AI crawler, what they do, and how to block them.

What Are AI Crawlers?

AI crawlers are automated bots that visit websites to collect content for training artificial intelligence models. Unlike traditional search engine crawlers (like Googlebot) that index pages so users can find them, AI crawlers scrape content to teach language models how to write, code, and answer questions.

When you interact with ChatGPT, Claude, or Gemini, the AI's knowledge comes partly from content collected by these crawlers. Your blog posts, documentation, creative writing, and code examples may have been used to train these models—often without explicit consent or compensation.

AI Crawlers by Company

OpenAI (ChatGPT, GPT-4)

Crawler	Purpose	Recommendation
`GPTBot`	Collects content for training future GPT models	Block to prevent training
`ChatGPT-User`	Real-time browsing when users ask about current events	Allow if you want ChatGPT citations
`OAI-SearchBot`	Powers OpenAI's search features	Allow for search visibility

Key insight: GPTBot and ChatGPT-User are different. Blocking GPTBot prevents training, but blocking ChatGPT-User means ChatGPT can't browse your site to answer questions.

Anthropic (Claude)

Crawler	Purpose	Recommendation
`ClaudeBot`	Training data collection for Claude models	Block to prevent training
`anthropic-ai`	General Anthropic crawler	Block alongside ClaudeBot
`Claude-Web`	Claude's web browsing feature	Allow if you want Claude citations

Google (Gemini, Bard)

Crawler	Purpose	Recommendation
`Google-Extended`	AI training for Gemini/Bard products	Block to prevent AI training
`Googlebot`	Search indexing (SEO)	Never block - for search rankings

Critical: Google-Extended is completely separate from Googlebot. Blocking Google-Extended does NOT affect your Google Search rankings.

Meta (Llama, Facebook AI)

Crawler	Purpose	Recommendation
`Meta-ExternalAgent`	Training data for Llama and Meta AI	Block to prevent training
`Meta-ExternalFetcher`	Link previews in Facebook/Instagram	Usually allow for social sharing
`FacebookBot`	General Meta crawling	Consider blocking

Other Notable AI Crawlers

Crawler	Company	Purpose	Action
`Applebot-Extended`	Apple	Apple AI training	Block
`PerplexityBot`	Perplexity	AI-powered search engine	Allow
`cohere-ai`	Cohere	LLM training	Block
`Bytespider`	ByteDance (TikTok)	Aggressive crawler, suspected AI training	Block
`CCBot`	Common Crawl	Public dataset used by many AI companies	Block
`Amazonbot`	Amazon	Alexa and Amazon AI	Block
`Diffbot`	Diffbot	Web data extraction for AI	Block

Understanding Crawler Categories

Training Crawlers

Collect data to train AI models. Once used, your content becomes part of the model's "knowledge"—there's no way to remove it.

Examples: GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent

Browsing Crawlers

Access your site in real-time when users ask AI assistants questions. Similar to how a human would browse.

Examples: ChatGPT-User, Claude-Web, PerplexityBot

Dataset Crawlers

Build large public archives that AI companies use for training. Blocking prevents your content from entering these datasets.

Examples: CCBot (Common Crawl), Diffbot, Omgilibot

Aggressive Crawlers

Known for high request rates, poor rate limiting, or unclear data usage policies. Worth blocking regardless of AI concerns.

Examples: Bytespider, FriendlyCrawler, ISSCyberRiskCrawler

How to Block AI Crawlers

Method 1: robots.txt Recommended

Add rules to your robots.txt file at your website root:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# Allow search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /

Generate Your robots.txt Automatically

Method 2: Meta Tags

Add to your HTML <head> for page-level control:

<meta name="robots" content="noai, noimageai">

Note: Not all AI crawlers respect meta tags. OpenAI has proposed noai and noimageai directives, but adoption varies.

Method 3: HTTP Headers

Configure your server to send headers:

X-Robots-Tag: noai, noimageai

Method 4: Rate Limiting

Implement aggressive rate limiting for known AI user-agents:

# Nginx example
if ($http_user_agent ~* (GPTBot|ClaudeBot|CCBot)) {
    set $limit_rate 10k;
}

Common Questions

No. Search engine crawlers (Googlebot, Bingbot) are completely separate from AI training crawlers. Google explicitly created a separate user-agent (Google-Extended) for AI training so website owners can block AI use without affecting search rankings.

Major companies (OpenAI, Anthropic, Google, Meta, Apple, Amazon) have publicly committed to respecting robots.txt for their AI crawlers. However, smaller companies and scrapers may ignore it, and robots.txt is a request, not an enforcement mechanism.

Unfortunately, there's no way to remove content from an already-trained model. The training data becomes part of the model's weights—it's not stored in a retrievable database. Blocking crawlers only prevents future training.

Common Crawl is a nonprofit that archives the web. Many AI companies use their datasets for training. If you're concerned about AI training, blocking CCBot is effective because it prevents your content from entering these widely-used datasets.

Check your server access logs for User-Agent strings:

grep -E "(GPTBot|ClaudeBot|CCBot|Google-Extended)" access.log

The Balanced Approach

Most website owners want to:

Prevent their content from training AI models (no compensation, no control)
Allow AI assistants to cite them in answers (visibility, traffic)

Recommended Configuration:

Block

GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent

Allow

ChatGPT-User, Claude-Web, PerplexityBot

Always Allow

Googlebot, Bingbot (for SEO)

Generate Your robots.txt Now

Use the "Block Training, Allow Browsing" preset

Key Takeaways

AI training and AI browsing are different—you can block one while allowing the other
Blocking Google-Extended doesn't affect your Google Search rankings
robots.txt is respected by major companies but isn't a guarantee
You can only prevent future use—already-trained models can't be "untrained"

Generate robots.txt

Frequently Asked Questions

No. Google created a separate user-agent (Google-Extended) specifically for AI training that is completely independent from Googlebot (used for search indexing). Blocking Google-Extended, GPTBot, ClaudeBot, or any AI training crawler has zero effect on your SEO or search visibility.

GPTBot collects data to train future GPT models—once scraped, your content becomes part of the model permanently. ChatGPT-User is for real-time browsing when users ask ChatGPT questions about current events. Block GPTBot to prevent training; allow ChatGPT-User if you want ChatGPT to cite your content in answers.

Major companies (OpenAI, Anthropic, Google, Meta, Apple, Amazon) have publicly committed to respecting robots.txt for their documented AI crawlers. However, robots.txt is a voluntary standard, not an enforcement mechanism. Smaller scrapers or undocumented bots may ignore it entirely.

No. Once your content is used for training, it becomes part of the model's neural network weights—there's no database to delete from. Training data is mathematically transformed into model parameters. Blocking crawlers only prevents your content from being used in future model training.

Common Crawl is a nonprofit that archives the entire web. Many AI companies (including those training LLMs) use Common Crawl datasets for training instead of crawling directly. Blocking CCBot prevents your content from entering these widely-used public datasets that power numerous AI systems.

Check your server access logs for User-Agent strings containing GPTBot, ClaudeBot, CCBot, Google-Extended, Bytespider, or anthropic-ai. Use grep: grep -E '(GPTBot|ClaudeBot|CCBot|Google-Extended)' access.log. Web analytics tools may not capture bot traffic since bots don't execute JavaScript.

Most site owners want to block training bots (GPTBot, ClaudeBot, Google-Extended) while allowing browsing bots (ChatGPT-User, Claude-Web, PerplexityBot). This prevents your content from training models while still allowing AI assistants to cite you in answers, potentially driving traffic to your site.

OpenAI proposed these meta tag directives () for page-level control. However, adoption varies across AI companies, and not all crawlers respect them. robots.txt remains the more reliable method since major companies have committed to honoring it.

Bytespider is ByteDance's (TikTok) crawler, known for aggressive crawling behavior and high request rates. While its exact purpose isn't fully disclosed, it's suspected to collect data for AI training. Many webmasters block it due to the server load it causes, regardless of AI concerns.

Review your robots.txt quarterly as new AI crawlers emerge regularly. Major AI companies like OpenAI, Anthropic, Google, and Meta occasionally add new crawler user-agents. Follow AI industry news or use tools that track new crawler announcements to stay current.