I requested the robots.txt of the 100 most visited news websites in the world as reported by Ahrefs.
Of the 93 that responded 200 (7 403d via their CDN), the table shows the % that partly or wholly disallow useragents explicitly for training LLMs. Of course, this doesn’t necessarily prevent their content from being used in RAG or other systems that reference external sources at runtime.
BOT NAME | PERCENTAGE DISALLOW |
---|---|
GPTBot | 57% |
CCBot | 47% |
Google-Extended | 40% |
anthropic-ai | 38% |
Bytespider | 32% |
FacebookBot | 27% |
Applebot-Extended | 23% |
FriendlyCrawler | 11% |
Baiduspider | 5% |
img2dataset | 5% |
cohere-training-data-crawler | 3% |
AmazonBot | 1% |
These useragents are those identified as used for training models, rather than for sending traffic via RAG, i.e. GPTBot not ChatGPT-User
Surprising though that it’s not closer to 100%, so getting brands mentioned in the press will still sometimes make it in to the training corpus of major LLMs.