Managing crawl budget for large sites

March 25, 2016

If you manage a site with millions or even billions of URLs, it’s important to consider that Google and Bing have a crawl budget, a limit in the number of URLs they are prepared to crawl, for every domain, determined by its authority.

If a less-authoritative domain has billions of URLs, Google won’t crawl potentially important sections of your site, thus losing you traffic.

So one of the biggest SEO challenges for large ecommerce sites is balancing:

Not missing traffic by excluding product and aspect (aka attribute) combinations that have search demand

vs.

Not spamming the search engines with many combinations for which there is no demand

An example

For example, Size and Colour aspects for sites that sell both televisions and shoes:

	Size	Colour
Televisions	- High search volume - Important filter aspect	- Low search volume - Unimportant filter
Shoes	- Low search volume - Important filter	- High search volume - Important filter aspect

Why not open up every combination?

Take one category e.g. Men’s sports shoes, with 6 aspects in your catalogue:

Aspect	Example	Number of aspect values
Brand	Nike	150
Shoe size	10	30
Colour	Blue	15
Style	Basketball shoes	5
Material	Leather	4
Line	Air Jordan	200

Every combination of every aspect value multiplies up very quickly:

150 * 30 * 15 * 5 * 4 * 200 = 270,000,000

That is, 270 million possible URLs for this category alone!

The solution

Understand which aspects have values which are primarily searched in Google and only open those aspects for crawling.

You’ll need large and representative keyword set, potentially millions of keywords depending on the scope of your site, but here’s a rough example on a limited keyword set as an example

Category: Mens Trainers	Matching Aspect 1	Matching Aspect 2	Avg. Monthly UK Google Searches
mens white trainers	Colour		2900
mens running trainers	Style		2900
mens black trainers	Colour		2400
nike mens trainers	Brand		1600
white trainers mens	Colour		1600
mens trainers uk	None		2400
black trainers mens	Colour		1900
mens gym trainers white	Style	Colour	1300
all black trainers mens	Colour		1000
mens red trainers	Colour		880

Full list here, note the data has been tweaked to better illustrate the concept. Also for American readers, trainers = sneakers 🙂

The search volume for the keywords can be clustered into the category’s aspects e.g.

Aspect	Searches Containing a value for this aspect
Colour	19,920
Style	13,420
None	6,000
Brand	3,750
Material	3,150
Size	2,720

From this we can see that Colour and Style are important to open up to crawl, Material and Size less so.

Good:

Potentially a waste:

Just removing Material and Size, dramatically reduces the number of aspect combinations:

150 × 15 × 5 × 200 = 2,250,000

Saving us 267,750,000 URLs required for crawling. Not bad!

As aspects can be common across categories (e.g. size and colour), to exclude categories selectively, append a string which you have excluded in robots.txt to those categories you choose to not have indexed e.g.

And then in your robots.txt:

User-agent: *
  Disallow: /*search=nope

Summary

Sites with broad inventories, whether products, jobs, holiday destinations or anything else, should be careful to only open for crawl aspect combinations where there is real external demand.

Also important, sign up to my totally unrelated side project, Mustard Threads 🙂