Readthedocs.org: Custom robots.txt support?

Created on 13 Oct 2017 · 13Comments · Source: readthedocs/readthedocs.org

We've talked about blowing away the protected designation, so not sure if it makes sense to put special case on the protected privacy level, but maybe a separate option for docs that shouldn't be crawled?

Accepted Feature design decision

Source

agjohnson

👍1

Most helpful comment

@dasdachs @astrofrog we just merged a PR that will allow to use a custom robots.txt. I will be deployed soon. Here are the docs: https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docs

Please, after the deploy and following the docs let us know if it works as you expected.

humitos on 16 Jan 2019

👍2

All 13 comments

@agjohnson any momentum on this particular item? What is the current recommendation to NOINDEX/NOFOLLOW a site?

dend on 30 Mar 2018

At very least, we could kill our global robots.txt redirect in nginx and allow projects to contribute their own robots.txt via a static page in Sphinx

agjohnson on 19 Sep 2018

@agjohnson what's the status of this issue?

I'm not sure to clearly understand what's the action needed here.

if it's around Protected Privacy level, I think we can close it as won't fix since we are removing the privacy levels from Community site.
if it's about giving our users a way to upload by themselves a robots.txt I think the solution that I proposed at https://github.com/rtfd/readthedocs.org/issues/2430#issuecomment-418471125 should work (there is an example of a repository in that conversation also) and we can close this issue.

If none of those are what you have in mind, please elaborate a little more what you are considering here.

humitos on 11 Oct 2018

@humitos the solution provided in #2430 (comment) is not optimal:

Your site can have only one robots.txt file.

The robots.txt file must be located at the root of the website host that it applies to. For instance, to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located at http://www.example.com/robots.txt. It cannot be placed in a subdirectory ( for example, at http://example.com/pages/robots.txt). If you're unsure about how to access your website root, or need permissions to do so, contact your web hosting service provider. If you can't access your website root, use an alternative blocking method such as meta tags.

Google support

I think the only viable option is using the "meta tags" method [1][2]. I am working on a workaround for Astropy's docs (refer to issue #7794 and pull request #7874).

I'll be done by the end of the day and will let you know. If it's a good workaround. I'd be happy to document the process.

dasdachs on 11 Oct 2018

👍1

@dasdachs I see. You are right.

I'll be done by the end of the day and will let you know. If it's a good workaround. I'd be happy to document the process.

If the workaround by using meta tags is a good one, maybe it's a good solution to be implemented by a sphinx extension. It's still a hack, but at least "an automatic one" :grimacing:

After reading the docs you linked, I don't see a solution coming from Sphinx or without a hack, so I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML (or similar) and copy it at the root of the subdomain. Not sure if that's possible, though.

humitos on 11 Oct 2018

I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML

This is not trivial.

With that file, we will need to do:

append our own set of rules to the custom robots.txt
sync the result to all our web servers
- since this file will be _outside_ Sphinx output, we need adapt that code
modify the nginx rule to try serving first the custom robots.txt from the project/version and as a fallback serve ours

This raise another problem: we have one subdomain with multiples versions but only _one_ root place to serve the robots.txt file. Which one should we serve?

Being a "global setting" makes me doubt if it isn't better to add a text box in the admin where the user can paste the contents of that file or think something easier like that.

humitos on 11 Oct 2018

I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML

I doubt this will be on the yaml, as this is a per-project configuration rather than per-version

stsewd on 11 Oct 2018

The hack I found could be quite simple (this): add meta tags to files you don't want indexed.
But because of the global robots.txt, it would have no affect (refering to this answer from Google). Some solution using YAML or a text box seems like the way to go.

dasdachs on 12 Oct 2018

Unfortunately, the idea of adding meta tags isn't really an ideal solution, because we can't add it to all the old versions we host. In the case of astropy for example, we host a lot of old versions based on GitHub tags, e.g.:

http://docs.astropy.org/en/v1.0/

We can't change all the tags in our GitHub repo for all the old versions, so any solution that involves changes to the repository are a no-go. The only real solution would be to be able to customize robots.txt from the RTD settings interface.

astrofrog on 18 Oct 2018

Please, after the deploy and following the docs let us know if it works as you expected.

humitos on 16 Jan 2019

👍2

@humitos This is amazing. Thanks for the great work!

dasdachs on 17 Jan 2019

❤1

What is the best way to add a custom robots.txt file and sitemap.xml file to a readthedocs.com external domain?

AmmaraAnis on 22 Apr 2020

@AmmaraAnis Hi! For robots.txt you can read this FAQ at https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docs

Regarding, sitemap.xml there is no way to modify the default server at root yet (see #6938) although, you can change the Sitemap: entry in your robots.txt pointing to a custom one and that _may_ work.

humitos on 22 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings