100% Private

Complete Guide to Blocking AI Crawlers

12 min read Last updated: January 2025

As AI companies train their models on web content, many website owners want to control how their work is used. This guide explains every AI crawler, what they do, and how to block them.

What Are AI Crawlers?

AI crawlers are automated bots that visit websites to collect content for training artificial intelligence models. Unlike traditional search engine crawlers (like Googlebot) that index pages so users can find them, AI crawlers scrape content to teach language models how to write, code, and answer questions.

When you interact with ChatGPT, Claude, or Gemini, the AI's knowledge comes partly from content collected by these crawlers. Your blog posts, documentation, creative writing, and code examples may have been used to train these models—often without explicit consent or compensation.

AI Crawlers by Company

OpenAI (ChatGPT, GPT-4)

CrawlerPurposeRecommendation
GPTBotCollects content for training future GPT modelsBlock to prevent training
ChatGPT-UserReal-time browsing when users ask about current eventsAllow if you want ChatGPT citations
OAI-SearchBotPowers OpenAI's search featuresAllow for search visibility

Anthropic (Claude)

CrawlerPurposeRecommendation
ClaudeBotTraining data collection for Claude modelsBlock to prevent training
anthropic-aiGeneral Anthropic crawlerBlock alongside ClaudeBot
Claude-WebClaude's web browsing featureAllow if you want Claude citations

Google (Gemini, Bard)

CrawlerPurposeRecommendation
Google-ExtendedAI training for Gemini/Bard productsBlock to prevent AI training
GooglebotSearch indexing (SEO)Never block - for search rankings

Meta (Llama, Facebook AI)

CrawlerPurposeRecommendation
Meta-ExternalAgentTraining data for Llama and Meta AIBlock to prevent training
Meta-ExternalFetcherLink previews in Facebook/InstagramUsually allow for social sharing
FacebookBotGeneral Meta crawlingConsider blocking

Other Notable AI Crawlers

CrawlerCompanyPurposeAction
Applebot-ExtendedAppleApple AI trainingBlock
PerplexityBotPerplexityAI-powered search engineAllow
cohere-aiCohereLLM trainingBlock
BytespiderByteDance (TikTok)Aggressive crawler, suspected AI trainingBlock
CCBotCommon CrawlPublic dataset used by many AI companiesBlock
AmazonbotAmazonAlexa and Amazon AIBlock
DiffbotDiffbotWeb data extraction for AIBlock

Understanding Crawler Categories

Training Crawlers

Collect data to train AI models. Once used, your content becomes part of the model's "knowledge"—there's no way to remove it.

Examples: GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent

Browsing Crawlers

Access your site in real-time when users ask AI assistants questions. Similar to how a human would browse.

Examples: ChatGPT-User, Claude-Web, PerplexityBot

Dataset Crawlers

Build large public archives that AI companies use for training. Blocking prevents your content from entering these datasets.

Examples: CCBot (Common Crawl), Diffbot, Omgilibot

Aggressive Crawlers

Known for high request rates, poor rate limiting, or unclear data usage policies. Worth blocking regardless of AI concerns.

Examples: Bytespider, FriendlyCrawler, ISSCyberRiskCrawler

How to Block AI Crawlers

Method 1: robots.txt Recommended

Add rules to your robots.txt file at your website root:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: Google-Extended Disallow: /

User-agent: CCBot Disallow: /

# Allow search engines User-agent: Googlebot Allow: /

User-agent: Bingbot Allow: /

Generate Your robots.txt Automatically

Method 2: Meta Tags

Add to your HTML <head> for page-level control:

<meta name="robots" content="noai, noimageai">

Note: Not all AI crawlers respect meta tags. OpenAI has proposed noai and noimageai directives, but adoption varies.

Method 3: HTTP Headers

Configure your server to send headers:

X-Robots-Tag: noai, noimageai

Method 4: Rate Limiting

Implement aggressive rate limiting for known AI user-agents:

# Nginx example
if ($http_user_agent ~* (GPTBot|ClaudeBot|CCBot)) {
    set $limit_rate 10k;
}

Common Questions

No. Search engine crawlers (Googlebot, Bingbot) are completely separate from AI training crawlers. Google explicitly created a separate user-agent (Google-Extended) for AI training so website owners can block AI use without affecting search rankings.

Major companies (OpenAI, Anthropic, Google, Meta, Apple, Amazon) have publicly committed to respecting robots.txt for their AI crawlers. However, smaller companies and scrapers may ignore it, and robots.txt is a request, not an enforcement mechanism.

Unfortunately, there's no way to remove content from an already-trained model. The training data becomes part of the model's weights—it's not stored in a retrievable database. Blocking crawlers only prevents future training.

Common Crawl is a nonprofit that archives the web. Many AI companies use their datasets for training. If you're concerned about AI training, blocking CCBot is effective because it prevents your content from entering these widely-used datasets.

Check your server access logs for User-Agent strings:
grep -E "(GPTBot|ClaudeBot|CCBot|Google-Extended)" access.log

The Balanced Approach

Most website owners want to:

  1. Prevent their content from training AI models (no compensation, no control)
  2. Allow AI assistants to cite them in answers (visibility, traffic)

Recommended Configuration:

Block

GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent

Allow

ChatGPT-User, Claude-Web, PerplexityBot

Always Allow

Googlebot, Bingbot (for SEO)

Generate Your robots.txt Now

Use the "Block Training, Allow Browsing" preset

Key Takeaways

  1. AI training and AI browsing are different—you can block one while allowing the other
  2. Blocking Google-Extended doesn't affect your Google Search rankings
  3. robots.txt is respected by major companies but isn't a guarantee
  4. You can only prevent future use—already-trained models can't be "untrained"

Privacy Notice: This site works entirely in your browser. We don't collect or store your data. Optional analytics help us improve the site. You can deny without affecting functionality.