Detect AI Agent Bots from User-Agent
Identify AI agent crawlers like ChatGPT-User, GPTBot, and ClaudeBot from User-Agent strings. Understand how to manage AI crawler access with robots.txt.
Detailed Explanation
Detecting AI Agent Crawlers
With the rise of large language models (LLMs), a new category of web crawlers has emerged: AI agent bots. These crawlers fetch web content to train AI models or to provide real-time web browsing capabilities.
Known AI Agent User-Agents
OpenAI:
ChatGPT-User— Used when ChatGPT browses the web in real-timeGPTBot/1.0— OpenAI's general training data crawlerOAI-SearchBot/1.0— OpenAI's search product crawler
Anthropic:
ClaudeBot— Anthropic's web crawler for Claude
Others:
Bytespider— ByteDance (powers TikTok's AI features)CCBot/2.0— Common Crawl (open dataset used by many AI companies)Google-Extended— Google's AI training crawler (separate from Googlebot)PerplexityBot— Perplexity AI's search crawler
Example UA Strings
ChatGPT browsing:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot)
GPTBot:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Managing AI Crawler Access
Use robots.txt to control AI crawler access:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Allow search engine crawlers
User-agent: Googlebot
Allow: /
Key Considerations
- AI crawlers are a rapidly evolving space — new bots appear frequently
- Some AI companies respect robots.txt; others may not
- Blocking
GPTBotdoes not affectGooglebotor regular Google Search Google-Extendedcontrols Gemini/Bard training but is separate from Google Search indexing- Many AI crawlers identify themselves voluntarily, but some use generic or spoofed UAs
Use Case
Content publishers and website operators use AI bot detection to control whether their content is used for AI training purposes. Legal and compliance teams monitor AI crawler activity to enforce copyright policies. DevOps teams implement rate limiting specifically for AI crawlers to manage server load.