News sites that block useragents for training LLMs

May 03, 2025

I requested the robots.txt of the 100 most visited news websites in the world as reported by Ahrefs.

Of the 93 that responded 200 (7 403d via their CDN), the table shows the % that partly or wholly disallow useragents explicitly for training LLMs. Of course, this doesn’t necessarily prevent their content from being used in RAG or other systems that reference external sources at runtime.

BOT NAME	PERCENTAGE DISALLOW
GPTBot	57%
CCBot	47%
Google-Extended	40%
anthropic-ai	38%
Bytespider	32%
FacebookBot	27%
Applebot-Extended	23%
FriendlyCrawler	11%
Baiduspider	5%
img2dataset	5%
cohere-training-data-crawler	3%
AmazonBot	1%

These useragents are those identified as used for training models, rather than for sending traffic via RAG, i.e. GPTBot not ChatGPT-User

Surprising though that it’s not closer to 100%, so getting brands mentioned in the press will still sometimes make it in to the training corpus of major LLMs.