News sites that block useragents for training LLMs

May 03, 2025

A nightclub bouncer turns away a robot

I requested the robots.txt of the 100 most visited news websites in the world as reported by Ahrefs.

Of the 93 that responded 200 (7 403d via their CDN), the table shows the % that partly or wholly disallow useragents explicitly for training LLMs. Of course, this doesn’t necessarily prevent their content from being used in RAG or other systems that reference external sources at runtime.

BOT NAME PERCENTAGE DISALLOW
GPTBot 57%
CCBot 47%
Google-Extended 40%
anthropic-ai 38%
Bytespider 32%
FacebookBot 27%
Applebot-Extended 23%
FriendlyCrawler 11%
Baiduspider 5%
img2dataset 5%
cohere-training-data-crawler 3%
AmazonBot 1%

These useragents are those identified as used for training models, rather than for sending traffic via RAG, i.e. GPTBot not ChatGPT-User

Surprising though that it’s not closer to 100%, so getting brands mentioned in the press will still sometimes make it in to the training corpus of major LLMs.


Chris Reynolds

Chris Reynolds is a Bay Area Product Manager with 15 years of international experience in SEO, digital marketing, UX, analytics and team management.

© 2025 Chris Reynolds