Managing crawl budget for large sites

If you manage a site with millions or even billions of URLs, it’s important to consider that Google and Bing have a crawl budget, a limit in the number of URLs they are prepared to crawl, for every domain, determined by its authority.

fat-spiderman

If a less-authoritative domain has billions of URLs, Google won’t crawl potentially important sections of your site, thus losing you traffic.

So one of the biggest SEO challenges for large ecommerce sites is balancing:

Not missing traffic by excluding product and aspect (aka attribute) combinations that have search demand

vs.

Not spamming the search engines with many combinations for which there is no demand

An example

For example, Size and Colour aspects for sites that sell both televisions and shoes:

Size Colour
Televisions
  • High search volume
  • Important filter aspect
  • Low search volume
  • Unimportant filter
Shoes
  • Low search volume
  • Important filter
  • High search volume
  • Important filter aspect

Why not open up every combination?

Take one category e.g. Men’s sports shoes, with 6 aspects in your catalogue:

Aspect Example Number of aspect values
Brand Nike 150
Shoe size 10 30
Colour Blue 15
Style Basketball shoes 5
Material Leather 4
Line Air Jordan 200

Every combination of every aspect value multiplies up very quickly:

150 * 30 * 15 * 5 * 4 * 200 = 270,000,000

That is, 270 million possible URLs for this category alone!

The solution

Understand which aspects have values which are primarily searched in Google and only open those aspects for crawling.

You’ll need large and representative keyword set, potentially millions of keywords depending on the scope of your site, but here’s a rough example on a limited keyword set as an example

Category: Mens Trainers Matching Aspect 1 Matching Aspect 2 Avg. Monthly UK Google Searches
mens white trainers Colour 2900
mens running trainers Style 2900
mens black trainers Colour 2400
nike mens trainers Brand 1600
white trainers mens Colour 1600
mens trainers uk None 2400
black trainers mens Colour 1900
mens gym trainers white Style Colour 1300
all black trainers mens Colour 1000
mens red trainers Colour 880

Full list here, note the data has been tweaked to better illustrate the concept. Also for American readers, trainers = sneakers :-)

The search volume for the keywords can be clustered into the category’s aspects e.g.

Aspect Searches Containing a value for this aspect
Colour 19,920
Style 13,420
None 6,000
Brand 3,750
Material 3,150
Size 2,720

From this we can see that Colour and Style are important to open up to crawl, Material and Size less so.

Good:

  • https://site.com/trainers/blue
  • https://site.com/trainers/running
  • https://site.com/trainers/blue-running

Potentially a waste:

  • https://site.com/trainers/size-10
  • https://site.com/trainers/leather/size-10

Just removing Material and Size, dramatically reduces the number of aspect combinations:

150 × 15 × 5 × 200 = 2,250,000

Saving us 267,750,000 URLs required for crawling. Not bad!

As aspects can be common across categories (e.g. size and colour), to exclude categories selectively, append a string which you have excluded in robots.txt to those categories you choose to not have indexed e.g.

And then in your robots.txt:

User-agent: *
  Disallow: /*search=nope

Summary

Sites with broad inventories, whether products, jobs, holiday destinations or anything else, should be careful to only open for crawl aspect combinations where there is real external demand.

Also important, sign up to my totally unrelated side project, Mustard Threads :-)

No comments yet.

Leave a Reply