Google Search Console can complain of Indexed, though blocked by robots.txt for PrestaShop search results pages if it finds such pages outside of its crawl instructions. While crawling search pages is disallowed by a properly created robots.txt, they can still be indexed if they are found through other means, which is quite easy to do. For example, if one uses tags on the front end, these can appear as search links which will get indexed.
Proposed fix:
controllers/front/listing/SearchController.php should always include <meta name="robots" content="noindex"> in the page output in case a search engine finds a link to a search results page.
1.7.5.0 diff output below:
diff --git a/controllers/front/listing/SearchController.php b/controllers/front/listing/SearchController.php
index 400abc4d60..a31e9710d0 100644
--- a/controllers/front/listing/SearchController.php
+++ b/controllers/front/listing/SearchController.php
@@ -60,6 +60,15 @@ class SearchControllerCore extends ProductListingFrontController
);
}
+ public function getTemplateVarPage()
+ {
+ $page = parent::getTemplateVarPage();
+
+ $page['meta']['robots'] = 'noindex';
+
+ return $page;
+ }
+
/**
* Performs the search.
*/
Steps to reproduce the behavior:
Indexed, though blocked by robots.txt warnings under Coverage for every encountered search link.PrestaShop version: 1.7.5.0
PHP version: 7.0
Hi @watou,
Thank you for your report.
We'll first try to reproduce it and we'll come back to you if we need more information
Thanks!
Hi @watou,
In the BO => Shop Parameters => Search => Tags page.
I added the headache tag to some products, I added a cms page with this tag also.
In the FO => Link to search results in my shop => OK

Now, I need to wait because Google Search Console is in processing

Thanks!
There are two problems I can see from here regarding your test setup:
<a href="http://testip6.tk/1751/en/search?tag=headache"> or equivalent in any menus or other crawlable pages, so unless I missed it, it won't be reached by the Googlebot, and thus won't create the issue described here. Maybe you could put the link http://testip6.tk/1751/en/search?tag=headache in the footer directly so it's found for certain? I looked at the headers and source to http://testip6.tk/1751/en/search?tag=headache and it does not contain instructions to not be indexed, which will cause the issue if the above two issues are corrected. Thank you!
@watou, thanks for your feedback.
Thanks!
Hi @khouloudbelguith, thanks for that. There still needs to be a link to
http://testip6.tk/1751/en/search?tag=headache contained somewhere within the crawlable site. Tagging the CMS page itself does not make a link anywhere. Let's say you wanted to let visitors see all products tagged "headache" -- you would have to provide them the link http://testip6.tk/1751/en/search?tag=headache somewhere in your site. There are "tag cloud" modules for PrestaShop that do this all over sites, so visitors can see various arbitrary product groupings. The problem is the conflict between robots.txt disallowing crawling of search results pages, but the very real possibility that search results pages exist as links in the site.
@watou, I added this tag "headache" to some categories

And to some pages also

Ps: until now, I'm still waiting

Thanks!
Hi @khouloudbelguith, your test setup is still incorrect and as such it will never show the issue I've described. Please tell me the page in your test site that has the link on which I can click http://testip6.tk/1751/en/search?tag=headache. If the answer is that your site site does not contain such a link on any page, please change your test site that it does have such a link in order to have your test be valid.
You can also use the robots.txt testing tool https://www.google.com/webmasters/tools/robots-testing-tool?pli=1 to see that the link would be blocked from crawling the search results page by line 132 of robots.txt, but Google would still index the page because the link was in your site (because you want it there).
The answer is not to remove line 132 from robots.txt, but instead to not index search results pages.
Hi @watou,
I added the module Tags block v1.3.1.
This module adds a block containing your product tags.
In fact, this module redirects to the link shop.com/search?tag=headache
But this link only contains products, not pages & categories.
I attached a video record.
https://drive.google.com/file/d/1g_Srt_mOyVOLtGQztVTIbYEPkaBt--DP/view
Thanks to check & feedback.
Hi @khouloudbelguith, if you were to add that Tags block module to the public site http://testip6.tk, it would put the links to search pages on the crawlable site as your video demonstrated, and Google would eventually complain (as this issue reports) that search results pages are blocked from crawling by robots.txt but the search results pages were indexed anyway. Thanks for making that video.
By applying my proposed change to suppress the indexing of search results pages, Google will not index them and thus not complain. This is preferable, in my opinion, from 1) allowing Google to crawl any/all search results pages by removing the Disallow from robots.txt, or 2) having Google complain as in the current situation.
@watou, thanks for these clarifications.
I will check & feedback.
@watou, Done.
I tried with another shop, I added the module, you can follow this link: http://presta1and1.com/prestashop_1.7.5.1/index.php?controller=search&tag=headache&page=2
The block of this module is added in this page: http://presta1and1.com/prestashop_1.7.5.1/index.php?id_category=2&controller=category
http://presta1and1.com/robots.txt => OK.
http://presta1and1.com/prestashop_1.7.5.1/robots.txt => OK.
Thanks to check & feedback.
I think you now have the correct circumstances for test. You can verify yourself with the robots.txt testing tool against the presta1and1.com property to make sure the URL http://presta1and1.com/prestashop_1.7.5.1/index.php?controller=search&tag=headache is blocked, but I think it is by line 54 in /robots.txt.
After Google indexes the page, the issue described here should be reported in Google Search Console.
@watou, thanks for your feedback.
Yes, this link is blocked.
Now, we need to wait to get the exact error reported in the Google Search Console.
In the meantime, would you be willing to make a pull request on GitHub with your code suggestion?
https://github.com/PrestaShop/PrestaShop/tree/develop
Thank's a lot!
Your robots.txt file does not block itself which is correct. Your robots.txt file should block all search results pages from being crawled. If it doesn't then there would be a problem with your configuration. I don't think you have that config problem. The problem in this report is that search pages can be indexed and should instead suppress their being indexed.
@watou, thanks!
So, the problem is reproduced in this screenshot

Is it?
Thanks!
So, the problem is reproduced in this screenshot
No, the problem is not reproduced. The image above shows that search pages are blocked from crawling, and this is correct. What is incorrect is that search pages can still be indexed when links to them are encountered, but there is no instruction in search pages to not index them. Google quite rightly complains about this (the subject of this issue). The proposed fix in #12817 would change that.
Please study this issue more carefully in order to fully understand what is being reported, and what is being proposed to fix it. I remain happy to answer any new questions.
@watou, thanks a lot for your feedback.
I attached a screenshot

Thanks to check & feedback.
Your screenshot exactly shows this issue. Now if my PR #12817 had been applied, the page would not have been indexed. Would PrestaShop and SEO experts agree that this is a correct approach?
@watou, thanks a lot for your help.
Your PR should be tested by our QA team.
Thanks!