We've talked about blowing away the protected designation, so not sure if it makes sense to put special case on the protected privacy level, but maybe a separate option for docs that shouldn't be crawled?
@agjohnson any momentum on this particular item? What is the current recommendation to NOINDEX/NOFOLLOW a site?
At very least, we could kill our global robots.txt redirect in nginx and allow projects to contribute their own robots.txt via a static page in Sphinx
@agjohnson what's the status of this issue?
I'm not sure to clearly understand what's the action needed here.
robots.txt I think the solution that I proposed at https://github.com/rtfd/readthedocs.org/issues/2430#issuecomment-418471125 should work (there is an example of a repository in that conversation also) and we can close this issue.If none of those are what you have in mind, please elaborate a little more what you are considering here.
@humitos the solution provided in #2430 (comment) is not optimal:
- Your site can have only one robots.txt file.
- The robots.txt file must be located at the root of the website host that it applies to. For instance, to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located at http://www.example.com/robots.txt. It cannot be placed in a subdirectory ( for example, at http://example.com/pages/robots.txt). If you're unsure about how to access your website root, or need permissions to do so, contact your web hosting service provider. If you can't access your website root, use an alternative blocking method such as meta tags.
I think the only viable option is using the "meta tags" method [1][2]. I am working on a workaround for Astropy's docs (refer to issue #7794 and pull request #7874).
I'll be done by the end of the day and will let you know. If it's a good workaround. I'd be happy to document the process.
@dasdachs I see. You are right.
I'll be done by the end of the day and will let you know. If it's a good workaround. I'd be happy to document the process.
If the workaround by using meta tags is a good one, maybe it's a good solution to be implemented by a sphinx extension. It's still a hack, but at least "an automatic one" :grimacing:
After reading the docs you linked, I don't see a solution coming from Sphinx or without a hack, so I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML (or similar) and copy it at the root of the subdomain. Not sure if that's possible, though.
I think we should implement this from Read the Docs itself by adding a
robotstxt_file:option in our YAML
This is not trivial.
With that file, we will need to do:
robots.txtrobots.txt from the project/version and as a fallback serve oursThis raise another problem: we have one subdomain with multiples versions but only _one_ root place to serve the robots.txt file. Which one should we serve?
Being a "global setting" makes me doubt if it isn't better to add a text box in the admin where the user can paste the contents of that file or think something easier like that.
I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML
I doubt this will be on the yaml, as this is a per-project configuration rather than per-version
Unfortunately, the idea of adding meta tags isn't really an ideal solution, because we can't add it to all the old versions we host. In the case of astropy for example, we host a lot of old versions based on GitHub tags, e.g.:
http://docs.astropy.org/en/v1.0/
We can't change all the tags in our GitHub repo for all the old versions, so any solution that involves changes to the repository are a no-go. The only real solution would be to be able to customize robots.txt from the RTD settings interface.
@dasdachs @astrofrog we just merged a PR that will allow to use a custom robots.txt. I will be deployed soon. Here are the docs: https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docs
Please, after the deploy and following the docs let us know if it works as you expected.
@humitos This is amazing. Thanks for the great work!
What is the best way to add a custom robots.txt file and sitemap.xml file to a readthedocs.com external domain?
@AmmaraAnis Hi! For robots.txt you can read this FAQ at https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docs
Regarding, sitemap.xml there is no way to modify the default server at root yet (see #6938) although, you can change the Sitemap: entry in your robots.txt pointing to a custom one and that _may_ work.
Most helpful comment
@dasdachs @astrofrog we just merged a PR that will allow to use a custom
robots.txt. I will be deployed soon. Here are the docs: https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docsPlease, after the deploy and following the docs let us know if it works as you expected.