100% Private

How to Block AI Crawlers: A Complete Guide

AI companies are crawling the web to train their models on publicly accessible content. Whether you want to prevent that, allow it selectively, or just understand what's happening to your site—this guide covers everything you need to know.

The Current Crawler Landscape

AI companies need text to train language models. They get it several ways: by crawling the web directly with their own bots, by licensing curated datasets, and by using public archives like Common Crawl. Your site is probably being crawled by multiple AI bots right now, and most of their traffic doesn't show up in JavaScript-based analytics tools like Google Analytics—you need raw server logs to see them.

The number of distinct AI crawlers grew from a handful in 2022 to dozens by 2026. The major ones are documented and respect robots.txt. Many smaller ones aren't documented and don't. The robots.txt approach is a meaningful signal to the companies that respect it—and those happen to be the ones training the most widely deployed models.

Training Bots vs. Browsing Bots

This distinction matters for deciding what to block. They have different purposes and different effects on your site.

Training Crawlers

Collect your content to train AI models. Once your content is used in a training run, it influences the model permanently—there's no retrieval or citation, it's absorbed into the weights.

Examples: GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent, Amazonbot, Applebot-Extended

Browsing Crawlers

Access your site in real-time when a user asks an AI assistant a question about current information. Similar to a user clicking your link. May cite your content and send referral traffic.

Examples: ChatGPT-User, Claude-Web, PerplexityBot, OAI-SearchBot

Most site owners want to block training crawlers while allowing browsing crawlers. Blocking browsing crawlers means AI assistants can't access your current content to answer questions—you lose the citation traffic. Allowing them means your content might be summarized in AI answers, which some view as competition for their own traffic.

All Known AI Crawlers

OpenAI (ChatGPT, GPT-4)

User-AgentPurposeRecommendation
GPTBot Training data collection for GPT models Block to prevent training
ChatGPT-User Real-time browsing for user queries Allow for ChatGPT citations
OAI-SearchBot OpenAI search and index features Allow for search visibility

Anthropic (Claude)

User-AgentPurposeRecommendation
ClaudeBot Training data for Claude models Block
anthropic-ai General Anthropic crawling Block alongside ClaudeBot
Claude-Web Real-time browsing for Claude users Allow for Claude citations

Google (Gemini)

User-AgentPurposeRecommendation
Google-Extended AI training for Gemini and AI Overview features Block
Googlebot Web search indexing (SEO) Never block
Google-CloudVertexBot Vertex AI training data Block

Critical: Google-Extended and Googlebot are completely separate. Blocking Google-Extended has no effect on your Google Search rankings.

Meta (Llama)

User-AgentPurposeRecommendation
Meta-ExternalAgent Training data for Llama and Meta AI Block
Meta-ExternalFetcher Link previews in Facebook/Instagram Usually allow

Other notable crawlers

User-AgentCompanyPurposeAction
CCBotCommon CrawlPublic dataset used by many AI companies for trainingBlock
BytespiderByteDance (TikTok)Aggressive crawler, suspected AI trainingBlock
Applebot-ExtendedAppleApple Intelligence trainingBlock
AmazonbotAmazonAlexa and Amazon AI trainingBlock
PerplexityBotPerplexityAI-powered search engine (browsing, not training)Allow
cohere-aiCohereLLM training dataBlock
DiffbotDiffbotWeb data extraction for AI applicationsBlock
ImagesiftBotImageSiftImage training data collectionBlock
OmgilibotOmgili/Webz.ioData collection for AI training datasetsBlock

robots.txt: The Primary Method

Your robots.txt file lives at the root of your domain: https://yoursite.com/robots.txt. Every compliant crawler checks this file before visiting any other page on your site. This is the most reliable method for controlling AI crawlers that have committed to respecting it.

Block all training crawlers

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: anthropic-ai Disallow: /

User-agent: Google-Extended Disallow: /

User-agent: Google-CloudVertexBot Disallow: /

User-agent: CCBot Disallow: /

User-agent: Meta-ExternalAgent Disallow: /

User-agent: Applebot-Extended Disallow: /

User-agent: Amazonbot Disallow: /

User-agent: Bytespider Disallow: /

User-agent: cohere-ai Disallow: /

User-agent: Diffbot Disallow: /

Allow browsing bots (can cite your content)

User-agent: ChatGPT-User Allow: /

User-agent: Claude-Web Allow: /

User-agent: PerplexityBot Allow: /

User-agent: OAI-SearchBot Allow: /

Always allow search engines

User-agent: Googlebot Allow: /

User-agent: Bingbot Allow: /

Selective blocking (specific directories only)

# Block AI training bots from premium content
User-agent: GPTBot
Disallow: /premium/
Disallow: /courses/
Disallow: /members/

Allow access to public pages

User-agent: GPTBot Allow: /blog/ Allow: /about/

Block everything with an allowlist

# Deny all by default, allow only known search engines
User-agent: *
Disallow: /

User-agent: Googlebot Allow: /

User-agent: Bingbot Allow: /

User-agent: DuckDuckBot Allow: /

The nuclear option. This blocks all unlisted bots including future AI crawlers you haven't heard of yet. Only search engines you explicitly allow can crawl your site.

Generate your robots.txt automatically: The ToolsDock robots.txt generator lets you select which crawlers to block and generates a correctly formatted file with all current user-agent strings.

Meta Tags and HTTP Headers

Meta tags provide page-level control that's separate from robots.txt. Useful when you want to allow some pages but block specific content pages from AI training.

Meta robots tag

<!-- In your HTML <head> -->

<!-- Standard: prevent indexing and following links --> <meta name="robots" content="noindex, nofollow">

<!-- OpenAI's proposed noai directives --> <meta name="robots" content="noai, noimageai">

<!-- Combined --> <meta name="robots" content="noindex, noai, noimageai">

<!-- Target specific bots --> <meta name="GPTBot" content="noindex">

Adoption of noai and noimageai across AI companies varies. These tags are a supplement to robots.txt, not a replacement. If a bot is already blocked in robots.txt, it shouldn't be fetching pages to read meta tags anyway.

HTTP response headers

# X-Robots-Tag header (equivalent to meta robots tag, but via HTTP)
# Add to your server configuration:

X-Robots-Tag: noai, noimageai

Or combine with other directives:

X-Robots-Tag: noindex, noai

HTTP headers work for any content type—not just HTML. A PDF served with X-Robots-Tag: noai communicates the restriction even though PDFs can't contain meta tags. The challenge: bots must make the request to receive the header, so this doesn't prevent the network request the way robots.txt does.

Server Configuration

nginx

# nginx.conf — block AI training bots with 403

map $http_user_agent $is_ai_bot { default 0; "~GPTBot" 1; "~ClaudeBot" 1; "~anthropic-ai" 1; "~Google-Extended" 1; "~CCBot" 1; "~Bytespider" 1; "~Meta-ExternalAgent" 1; "~Amazonbot" 1; }

server { # ... other config

if ($is_ai_bot) { return 403; } }

Alternative: rate limit instead of block

This slows bots without completely refusing

limit_req_zone $http_user_agent zone=bots:10m rate=1r/m;

location / { if ($is_ai_bot) { limit_req zone=bots burst=2 nodelay; } }

Apache .htaccess

# .htaccess — block AI training bots
RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR] RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR] RewriteCond %{HTTP_USER_AGENT} anthropic-ai [NC,OR] RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC,OR] RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR] RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR] RewriteCond %{HTTP_USER_AGENT} Meta-ExternalAgent [NC,OR] RewriteCond %{HTTP_USER_AGENT} Amazonbot [NC] RewriteRule .* - [F,L]

Cloudflare (WAF rules)

If you use Cloudflare, you can create WAF custom rules to block AI bots at the edge before they reach your server. Go to Security → WAF → Custom Rules and create a rule matching (http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") etc., with action "Block." Cloudflare also has managed bot lists that include known AI crawlers.

Server-side blocking vs. robots.txt

Server-side blocking (nginx, .htaccess, WAF) returns an HTTP error to the bot. This is technically stronger than robots.txt—the bot can't read the content even if it ignores the robots.txt directive. The downside: it consumes server resources to handle and respond to the request. robots.txt blocks before the request is made by compliant bots. For maximum protection, use both.

Monitoring Bot Traffic

Web analytics tools that rely on JavaScript (Google Analytics, Plausible, Fathom) don't capture bot traffic. Bots don't execute JavaScript. Raw server access logs are the only reliable way to see what's visiting your site.

Analyzing access logs

# Find all AI crawler hits in nginx access log
grep -E "(GPTBot|ClaudeBot|anthropic-ai|Google-Extended|CCBot|Bytespider)" \
  /var/log/nginx/access.log

Count by crawler type

grep -oE "(GPTBot|ClaudeBot|anthropic-ai|Google-Extended|CCBot|Bytespider)"
/var/log/nginx/access.log | sort | uniq -c | sort -rn

Show full log lines for GPTBot requests

grep "GPTBot" /var/log/nginx/access.log | tail -50

Count requests per day for AI bots

grep "GPTBot" /var/log/nginx/access.log |
awk '{print $4}' | cut -d: -f1 | sort | uniq -c

What to look for

  • Volume: Some bots are aggressive. Bytespider in particular has been reported crawling thousands of pages per hour on small sites. High bot traffic wastes bandwidth and server resources.
  • IP ranges: You can verify a bot's identity by checking if the request IP belongs to the company's documented IP ranges. OpenAI, Google, and others publish their crawler IP lists.
  • Compliance: After adding robots.txt blocks, check the logs a few days later to confirm the named bots stopped visiting. If they didn't, they're not honoring robots.txt.
  • New bots: Periodically check for unfamiliar user-agent strings. New AI crawlers emerge regularly.

What Actually Works

There's a spectrum of effectiveness. Here's an honest assessment:

MethodEffectivenessAgainst whom
robots.txt with named user-agents High Major AI companies that have committed to compliance (OpenAI, Anthropic, Google, Meta, Apple, Amazon)
Server-level blocking (nginx, .htaccess) High Any bot using its documented user-agent string
Meta noai/noimageai tags Moderate OpenAI (committed), some others; adoption still inconsistent
robots.txt wildcard block (User-agent: *) Moderate Polite bots that check robots.txt; ignored by scrapers that don't
IP blocking Moderate Bots using documented IP ranges; ineffective against distributed or rotating IP bots
Paywalls / login walls High Any automated crawler that doesn't have credentials
CAPTCHA challenges Moderate Bots without CAPTCHA-solving capability; increasingly less effective

What doesn't work: anything that relies on bots voluntarily identifying themselves honestly. A bad actor can trivially set any user-agent string. The methods above only work against bots that (a) use their documented user-agent and (b) respect the response. For major AI companies, both conditions hold. For smaller or undisclosed scrapers, neither may.

The hardest problem: undisclosed scraping

Some content ends up in training datasets through intermediaries—scraped via Common Crawl, licensed from data brokers, or collected by companies that don't disclose their AI use. You can block CCBot to prevent the Common Crawl path, but the broker and licensing paths have no technical countermeasure. This is where legal approaches become relevant.

The Balanced Approach

Most site owners land in the same place: block training, allow citation. The configuration below implements this:

# robots.txt — Block training, allow AI browsing/citations

# ===== BLOCK: AI Training Crawlers =====
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Google-CloudVertexBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: cohere-ai
Disallow: /

# ===== ALLOW: AI Browsing/Citation Bots =====
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

# ===== ALWAYS ALLOW: Search Engines =====
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: DuckDuckBot
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Generate a correctly formatted version for your site—including the most up-to-date list of user-agents—with the ToolsDock robots.txt generator.

After setting up robots.txt

  1. Wait 48–72 hours, then check your server logs to verify named bots have stopped visiting
  2. Monitor quarterly for new AI crawler user-agents—the field is still growing rapidly
  3. Add ToS language prohibiting AI training use if you want a legal paper trail
  4. Consider the EU opt-out mechanisms as they become available for cross-border protection

Last updated: March 2026. The AI crawler landscape changes frequently—check quarterly for new user-agents and updated commitments from AI companies.

Privacy Notice: This site works entirely in your browser. We don't collect or store your data. Optional analytics help us improve the site. You can deny without affecting functionality.