Readthedocs.org: Custom robots.txt support?

Created on 13 Oct 2017  路  13Comments  路  Source: readthedocs/readthedocs.org

We've talked about blowing away the protected designation, so not sure if it makes sense to put special case on the protected privacy level, but maybe a separate option for docs that shouldn't be crawled?

Accepted Feature design decision

Most helpful comment

@dasdachs @astrofrog we just merged a PR that will allow to use a custom robots.txt. I will be deployed soon. Here are the docs: https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docs

Please, after the deploy and following the docs let us know if it works as you expected.

All 13 comments

@agjohnson any momentum on this particular item? What is the current recommendation to NOINDEX/NOFOLLOW a site?

At very least, we could kill our global robots.txt redirect in nginx and allow projects to contribute their own robots.txt via a static page in Sphinx

@agjohnson what's the status of this issue?

I'm not sure to clearly understand what's the action needed here.

  1. if it's around Protected Privacy level, I think we can close it as won't fix since we are removing the privacy levels from Community site.
  2. if it's about giving our users a way to upload by themselves a robots.txt I think the solution that I proposed at https://github.com/rtfd/readthedocs.org/issues/2430#issuecomment-418471125 should work (there is an example of a repository in that conversation also) and we can close this issue.

If none of those are what you have in mind, please elaborate a little more what you are considering here.

@humitos the solution provided in #2430 (comment) is not optimal:

  1. Your site can have only one robots.txt file.
  2. The robots.txt file must be located at the root of the website host that it applies to. For instance, to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located at http://www.example.com/robots.txt. It cannot be placed in a subdirectory ( for example, at http://example.com/pages/robots.txt). If you're unsure about how to access your website root, or need permissions to do so, contact your web hosting service provider. If you can't access your website root, use an alternative blocking method such as meta tags.

Google support

I think the only viable option is using the "meta tags" method [1][2]. I am working on a workaround for Astropy's docs (refer to issue #7794 and pull request #7874).

I'll be done by the end of the day and will let you know. If it's a good workaround. I'd be happy to document the process.

@dasdachs I see. You are right.

I'll be done by the end of the day and will let you know. If it's a good workaround. I'd be happy to document the process.

If the workaround by using meta tags is a good one, maybe it's a good solution to be implemented by a sphinx extension. It's still a hack, but at least "an automatic one" :grimacing:

After reading the docs you linked, I don't see a solution coming from Sphinx or without a hack, so I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML (or similar) and copy it at the root of the subdomain. Not sure if that's possible, though.

I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML

This is not trivial.

With that file, we will need to do:

  1. append our own set of rules to the custom robots.txt
  2. sync the result to all our web servers

    • since this file will be _outside_ Sphinx output, we need adapt that code

  3. modify the nginx rule to try serving first the custom robots.txt from the project/version and as a fallback serve ours

This raise another problem: we have one subdomain with multiples versions but only _one_ root place to serve the robots.txt file. Which one should we serve?

Being a "global setting" makes me doubt if it isn't better to add a text box in the admin where the user can paste the contents of that file or think something easier like that.

I think we should implement this from Read the Docs itself by adding a robotstxt_file: option in our YAML

I doubt this will be on the yaml, as this is a per-project configuration rather than per-version

The hack I found could be quite simple (this): add meta tags to files you don't want indexed.
But because of the global robots.txt, it would have no affect (refering to this answer from Google). Some solution using YAML or a text box seems like the way to go.

Unfortunately, the idea of adding meta tags isn't really an ideal solution, because we can't add it to all the old versions we host. In the case of astropy for example, we host a lot of old versions based on GitHub tags, e.g.:

http://docs.astropy.org/en/v1.0/

We can't change all the tags in our GitHub repo for all the old versions, so any solution that involves changes to the repository are a no-go. The only real solution would be to be able to customize robots.txt from the RTD settings interface.

@dasdachs @astrofrog we just merged a PR that will allow to use a custom robots.txt. I will be deployed soon. Here are the docs: https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docs

Please, after the deploy and following the docs let us know if it works as you expected.

@humitos This is amazing. Thanks for the great work!

What is the best way to add a custom robots.txt file and sitemap.xml file to a readthedocs.com external domain?

@AmmaraAnis Hi! For robots.txt you can read this FAQ at https://docs.readthedocs.io/en/latest/faq.html#how-can-i-avoid-search-results-having-a-deprecated-version-of-my-docs

Regarding, sitemap.xml there is no way to modify the default server at root yet (see #6938) although, you can change the Sitemap: entry in your robots.txt pointing to a custom one and that _may_ work.

Was this page helpful?
0 / 5 - 0 ratings