If you manage a site with millions or even billions of URLs, it’s important to consider that Google and Bing have a crawl budget, a limit in the number of URLs they are prepared to crawl, for every domain, determined by its authority.
If a less-authoritative domain has billions of URLs, Google won’t crawl potentially important sections of your site, thus losing you traffic.
So one of the biggest SEO challenges for large ecommerce sites is balancing:
Not missing traffic by excluding product and aspect (aka attribute) combinations that have search demand
vs.
Not spamming the search engines with many combinations for which there is no demand
An example
For example, Size and Colour aspects for sites that sell both televisions and shoes:
Size | Colour | |
---|---|---|
Televisions | - High search volume - Important filter aspect | - Low search volume - Unimportant filter |
Shoes | - Low search volume - Important filter | - High search volume - Important filter aspect |
Why not open up every combination?
Take one category e.g. Men’s sports shoes, with 6 aspects in your catalogue:
Aspect | Example | Number of aspect values |
---|---|---|
Brand | Nike | 150 |
Shoe size | 10 | 30 |
Colour | Blue | 15 |
Style | Basketball shoes | 5 |
Material | Leather | 4 |
Line | Air Jordan | 200 |
Every combination of every aspect value multiplies up very quickly:
150 * 30 * 15 * 5 * 4 * 200 = 270,000,000
That is, 270 million possible URLs for this category alone!
The solution
Understand which aspects have values which are primarily searched in Google and only open those aspects for crawling.
You’ll need large and representative keyword set, potentially millions of keywords depending on the scope of your site, but here’s a rough example on a limited keyword set as an example
Category: Mens Trainers | Matching Aspect 1 | Matching Aspect 2 | Avg. Monthly UK Google Searches |
---|---|---|---|
mens white trainers | Colour | 2900 | |
mens running trainers | Style | 2900 | |
mens black trainers | Colour | 2400 | |
nike mens trainers | Brand | 1600 | |
white trainers mens | Colour | 1600 | |
mens trainers uk | None | 2400 | |
black trainers mens | Colour | 1900 | |
mens gym trainers white | Style | Colour | 1300 |
all black trainers mens | Colour | 1000 | |
mens red trainers | Colour | 880 |
Full list here, note the data has been tweaked to better illustrate the concept. Also for American readers, trainers = sneakers 🙂
The search volume for the keywords can be clustered into the category’s aspects e.g.
Aspect | Searches Containing a value for this aspect |
---|---|
Colour | 19,920 |
Style | 13,420 |
None | 6,000 |
Brand | 3,750 |
Material | 3,150 |
Size | 2,720 |
From this we can see that Colour and Style are important to open up to crawl, Material and Size less so.
Good:
- https://site.com/trainers/blue
- https://site.com/trainers/running
- https://site.com/trainers/blue-running
Potentially a waste:
Just removing Material and Size, dramatically reduces the number of aspect combinations:
150 × 15 × 5 × 200 = 2,250,000
Saving us 267,750,000 URLs required for crawling. Not bad!
As aspects can be common across categories (e.g. size and colour), to exclude categories selectively, append a string which you have excluded in robots.txt to those categories you choose to not have indexed e.g.
- https://site.com/trainers/leather/size-10?search=nope
- https://site.com/televisions/colour-black?search=nope
And then in your robots.txt:
User-agent: *
Disallow: /*search=nope
Summary
Sites with broad inventories, whether products, jobs, holiday destinations or anything else, should be careful to only open for crawl aspect combinations where there is real external demand.
Also important, sign up to my totally unrelated side project, Mustard Threads 🙂