Managing crawl budget for large sites

March 25, 2016

Fat spiderman metaphor FTW

If you manage a site with millions or even billions of URLs, it’s important to consider that Google and Bing have a crawl budget, a limit in the number of URLs they are prepared to crawl, for every domain, determined by its authority.

If a less-authoritative domain has billions of URLs, Google won’t crawl potentially important sections of your site, thus losing you traffic.

So one of the biggest SEO challenges for large ecommerce sites is balancing:

Not missing traffic by excluding product and aspect (aka attribute) combinations that have search demand

vs.

Not spamming the search engines with many combinations for which there is no demand

An example

For example, Size and Colour aspects for sites that sell both televisions and shoes:

SizeColour
Televisions- High search volume - Important filter aspect - Low search volume - Unimportant filter
Shoes- Low search volume - Important filter - High search volume - Important filter aspect

Why not open up every combination?

Take one category e.g. Men’s sports shoes, with 6 aspects in your catalogue:

AspectExampleNumber of aspect values
BrandNike150
Shoe size1030
ColourBlue15
StyleBasketball shoes5
MaterialLeather4
LineAir Jordan200

Every combination of every aspect value multiplies up very quickly:

150 * 30 * 15 * 5 * 4 * 200 = 270,000,000

That is, 270 million possible URLs for this category alone!

The solution

Understand which aspects have values which are primarily searched in Google and only open those aspects for crawling.

You’ll need large and representative keyword set, potentially millions of keywords depending on the scope of your site, but here’s a rough example on a limited keyword set as an example

Category: Mens TrainersMatching Aspect 1Matching Aspect 2Avg. Monthly UK Google Searches
mens white trainersColour2900
mens running trainersStyle2900
mens black trainersColour2400
nike mens trainersBrand1600
white trainers mensColour1600
mens trainers ukNone2400
black trainers mensColour1900
mens gym trainers whiteStyleColour1300
all black trainers mensColour1000
mens red trainersColour880

Full list here, note the data has been tweaked to better illustrate the concept. Also for American readers, trainers = sneakers 🙂

The search volume for the keywords can be clustered into the category’s aspects e.g.

AspectSearches Containing a value for this aspect
Colour19,920
Style13,420
None6,000
Brand3,750
Material3,150
Size2,720

From this we can see that Colour and Style are important to open up to crawl, Material and Size less so.

Good:

Potentially a waste:

Just removing Material and Size, dramatically reduces the number of aspect combinations:

150 × 15 × 5 × 200 = 2,250,000

Saving us 267,750,000 URLs required for crawling. Not bad!

As aspects can be common across categories (e.g. size and colour), to exclude categories selectively, append a string which you have excluded in robots.txt to those categories you choose to not have indexed e.g.

And then in your robots.txt:

User-agent: *
  Disallow: /*search=nope

Summary

Sites with broad inventories, whether products, jobs, holiday destinations or anything else, should be careful to only open for crawl aspect combinations where there is real external demand.

Also important, sign up to my totally unrelated side project, Mustard Threads 🙂


Chris Reynolds is a Bay Area Product Manager with 15 years of international experience in SEO, digital marketing, UX, analytics and team management.

© 2022 Chris Reynolds