In September, I set up a poll here on Search Engine Land to see if readers would like to have an instruction in robots.txt to mark pages for no indexing. Today I’m going to present the results along with a review of the top issues (and why Google won’t add support for this).
Why would this be interesting?
In today’s environment, robots.txt is used exclusively to guide web crawling behavior. Additionally, the current approach to tagging a “NoIndex” page is to place a tag on the page itself. Unfortunately, if you block it in robots.txt, Google will never see the tag and could still index the page even if you don’t want that to happen.
On large sites this presents some challenges when you have different classes of pages that you want to both prevent crawling AND exclude from the Google index. This can happen for example in complex faceted navigation implementations where you create pages that are of significant value to users but end up presenting too many pages to Google. For example, I went to a shoe retailer’s website and found that it had over 70,000 different pages related to “Nike shoes for men”. This includes a wide variety of sizes, widths, colors and more.
In some tests that I have participated in with sites with complex faceted navigation like the example I shared above, we have found that this large amount of pages is a significant issue. For one of these tests, we worked with a client to implement most of their faceted navigation in AJAX so that the presence of most of their faceted navigation pages was invisible to Google but still easily accessible by users. . The number of pages on this site fell from 200 million pages to 200,000 pages, a reduction from one thousand to one. Over the next year, traffic to the site tripled – an incredibly good result. However, traffic initially went down and it took about 4 months to get back to previous levels and then it spiked from there.
In another scenario, I saw a site implement a new ecommerce platform and its page count increased from around 5,000 pages to over 1 million. Their traffic dropped and we were brought in to help them recover. The fix? To bring the number of indexable pages back to where it was before. Unfortunately, as was done with tools like NoIndex and Canonical Tags, the speed of recovery was largely affected by the time it took for Google to revisit a significant number of pages on the site.
In both cases, the results for the companies involved were determined by Google’s crawl budget and the time it took to go through enough crawl to fully understand the new structure of the site. Having an instruction in Robots.txt would speed up these types of processes quickly.
What are the downsides of this idea?
I had the opportunity to discuss it with Patrick stox, Product Advisor and Brand Ambassador for Ahrefs, and his quick take was, “I don’t think that’s going to happen at least in robots.txt, maybe in another system like GSC. Google has made it clear that they want robots.txt just for crawling control. The biggest downside will probably be all the people who accidentally remove their entire site from the index.
And of course, this problem of getting the whole site (or key parts of a site) removed from the index is the big deal. Across the web, we don’t have to wonder whether or not this will happen – it will. Unfortunately, it will probably happen with some important sites, and unfortunately, it will probably happen a lot.
During my 20 year SEO experience, I have found that a misconception of how to use various SEO tags is rampant. For example, back in the days when Google Authorship was a thing and we had rel = author tags, I did a to study of how sites implemented them and found that 72% of sites used the tags incorrectly. This included some really well known sites in our industry!
In my discussion with Stox, he added, “Thinking of other downsides, they have to figure out how to deal with it when a robots.txt file is temporarily unavailable. Do they suddenly start indexing pages that were marked unindexed before? “
I also reached out to Google for comment, and was flagged to me their blog post when they removed noindex support from robots.txt in 2014. Here is what the post said about it:
“During the open source of our parser library, we analyzed the use of robots.txt rules. In particular, we focused on rules not supported by Internet draft, such as crawl-delay, nofollow, and noindex. Since these rules have never been documented by Google, their usage in relation to Googlebot is naturally very low. Digging deeper, we saw that their use was contradicted by other rules in almost 0.001% of all robots.txt files on the Internet. These errors hurt websites’ presence in Google’s search results in a way that we didn’t think webmasters wanted.“
* The last sentence bolded by me was done for emphasis.
I think that’s the determining factor here. Google acts to protect the quality of its index and what may seem like a good idea can have many unintended consequences. Personally, I would love to have the ability to bookmark pages for NoCrawl and NoIndex in a clear and easy way, but the truth is I don’t think that’s going to happen.
Overall results of the robots.txt survey
First of all, I would like to recognize a flaw in the survey in this question 2, a mandatory question, assuming you answered question 1 with a “yes”. Fortunately, most of the people who answered “no” to question 1 clicked “other” for question 2 and then indicated why they didn’t want this feature. One of those responses noted this flaw and said, “Your survey is misleading. My apologies for the fault there.
The overall results were as follows:
In total, 84% of the 87 respondents said “yes” they would like this feature. Some of the reasons given for wanting this feature were:
- There is no situation where I want to block crawling but have indexed pages.
- Noindexing a large number of pages takes a long time because Google has to crawl the page to see the noindex. When we had the noindex directive, we could get faster results for clients with over-indexing issues.
- We have a really big problem with case… very old content… hundreds of old directories and subdirectories and it apparently takes months, even years, to deindex them once we delete them and therefore 404. On looks like we could just add the NoIndex rule to the robots.txt file and believe that Google would adhere to that instruction much faster than having to crawl all the old URLs over time… and over and over… to find repeats 404 to finally remove them … so cleaning up our domains is one way to help.
- Save development effort and easily adjustable if anything breaks due to changes
- Can’t always use a “noindex” and too many indexed pages that shouldn’t be indexed. Standard blocking for spider should also “noindex” pages at least. If I want a search engine not to crawl a URL / folder, why would I want it to index those “empty” pages?
- Adding new instructions to a .txt file is much faster than getting development resources
- Yes it is difficult to change the meta in Head for Enterprise CRM, so the individual noindex feature in robots.txt would fix this problem.
- Faster and less problematic site indexing blocking 🙂
Other reasons to say no:
- The Noindex tag is quite good
- New directives in robots.txt file are not needed
- I don’t need it and I can’t see it working
- Do not bother
- Do not change
Here is. Most of the people who responded to this survey are in favor of adding this feature. However, keep in mind that SEL’s readership consists of a very sophisticated audience – with much more understanding and expertise than the average webmaster. In addition, even among the positive responses received in the survey, there were responses to question 4 (“Would this feature be useful to you as an SEO? If so, how”) which indicated a misunderstanding on how the SEO works. current system.
Ultimately, while I personally would love to have this feature, it’s highly unlikely to happen.
The opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.
New to Search Engine Land