_Note: This is about enhancing SEO._
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling.
The detailed technical specifications are available here.
It provides to search engines how often the spider should come back :
Set priority as:
1 for the pages of the latest or stable version. This option could be set in conf.py.0.1 at each version 0.1 for the pages for other version if there is more than 9 versions.lastmodchangefreq as :daily for the pages of the latest versionweekly for the pages of the last tag versionnever for the pages of other versions<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://django.readthedocs.org/en/latest/</loc>
<lastmod>2013-12-01T19:20:30.45+01:00</lastmod>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>http://django.readthedocs.org/en/1.6.x/</loc>
<lastmod>2013-11-30T19:20:30.45+01:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://django.readthedocs.org/en/1.5.x/</loc>
<lastmod>2013-10-03T19:20:30.45+01:00</lastmod>
<changefreq>never</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://django.readthedocs.org/en/0.1.x/</loc>
<lastmod>2013-10-03T19:20:30.45+01:00</lastmod>
<changefreq>never</changefreq>
<priority>0.1</priority>
</url>
</urlset>
We currently have logic in the code base for determining version order. We could just subtract .1 from the versions that are supported until we hit 0.1. We could also change the logic for tags and branches, since tags should never change, they can be updated much less frequently.
This this feature being worked on? I can work on this if no one has worked on it yet.
A theme that is doing this with an extension: https://github.com/guzzle/guzzle_sphinx_theme/blob/master/guzzle_sphinx_theme/__init__.py#L30
Another interesting approach: https://github.com/openstack/openstack-doc-tools/tree/master/sitemap
I took the sitemap logic out of guzzle_sphinx_theme and made it an extension/package here: https://github.com/jdillard/sphinx-sitemap
It's my first time making a package, but several people are using it successfully and I have it running in a few production environments myself.
Some are even using it on RTD, for example:
Neat, this is definitely something we could incorporate into the standard build process.
Great! This has been a project to help me learn new things and I'm very much still learning, so let me know if you need anything from me.
I'd also support this feature being available in the standard build process, as it might be especially relevant for multilingual RTD projects, see https://en.wikipedia.org/wiki/Sitemaps#Multilingual_and_multinational_Sitemaps
Context: For Godot Engine, we recently put up localized RTD instances, most of which are still over 80% of English text while translators work on things string by string. Search engines seem to have taken a particular liking to the Ukrainian instance for English queries, which puzzles many users. I hope the sitemap trick mentioned in the above link could fix that.
(I'll try jdillard's extension in the meantime)
@akien-mga I created a PR, jdillard/sphinx-sitemap#15, on my extension that adds support for multilingual sitemaps if you want to test it out and leave feedback there. I don't have much first hand experience with multi-lingual sphinx/RTD setups, so I might have missed some nuances.
Neat, this is definitely something we could incorporate into the standard build process.
@ericholscher What's your idea to accomplish this?
I'm thinking on installing the sphinx-sitemap extension by default (together with other default packages that we are installing) and add it when append_conf method is called with the user's conf.py plus the site_url setting with the canonical_url of the project.
What do you think? Is this the path to follow?
There are a few challenges with sitemaps. One challenge is that a sitemap is normally at /sitemap.xml by default. You can also specify where it is in robots.txt. So the sitemap should not be language specific (although you can have sub-sitemaps). It won't be at /$lang/latest/sitemap.xml.
One possibility is to make https://project.readthedocs.io/sitemap.xml a dynamic page which scans the different versions and translations under that domain for sitemap.xml files and links to them as sub-sitemaps. Another possibility is to let people upload a sitemap file that applies to all their versions and translations. Perhaps the simplest (but not a totally complete solution) is to just have the sitemap link to the root of different versions and translations.
@davidfischer You can also use a sitemapindex to manage multiple sitemaps. I'm not sure if the RTD build process could create that file (containing links to the sub-sitemaps) and place in the root directory.
@humitos I didn't realize there was already a html_baseurl config value that could have been used instead of site_url, and was thinking about switching to using that instead (with a backwards compatibility check for site_url). I'm not sure if that would make things easier.
@davidfischer You can also use a sitemapindex to manage multiple sitemaps. I'm not sure if the RTD build process could create that file (containing links to the sub-sitemaps) and place in the root directory.
This is exactly what I'm thinking! It is possible that RTD could dynamically generate the root sitemap rather than creating/updating it when builds happen.
Just to put all together and continue with the next step. We need to decide,
/sitemap.xml will be a Django view that will search for all the sitemap.xml on the project's output generated by Sphinx at build time, and generate the sitemap index with those files.I personally like the idea of making all of this automatically, but in that case we need to think if there could be users that don't want this for some particular reason (it could also be an option from the admin).
How about we do a combination of both! Here's my proposal:
/sitemap.xml a Django view. When requested, it looks for $lang/$version/sitemap.xml files and includes them in a sitemapindex as @jdillard proposes.I don't think any users will actively not want this so I don't know if being able to disable it is critical in the first implementation.
I like your proposal, @davidfischer
I think that we have something that it's _actionable_ now, and we can implement it. I'd love to see/get/receive a PR for this.
I don't think any users will actively _not_ want this so I don't know if being able to disable it is critical in the first implementation.
We will need to install a new dependency that could impact in the building time (not too much, though) but that could bring a new issue.
That was my only concern, but I think we are fine by installing and running this by default. It's a new feature that will benefit all the projects and may have a minimum impact on some particular projects (we could add a feature flag if we find problems around it)
We will need to install a new dependency that could impact in the building time (not too much, though) but that could bring a new issue.
I proposed that we do not add the extra sphinx extension for generating sitemaps by default. I think users should opt-in to it.
If users choose not to opt-in, the sitemap we display would just point to the active versions:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://django.readthedocs.org/en/latest/</loc>
<lastmod>2013-12-01T19:20:30.45+01:00</lastmod>
<changefreq>daily</changefreq>
<priority>1</priority>
</url>
<url>
<loc>http://django.readthedocs.org/en/1.6.x/</loc>
<lastmod>2013-11-30T19:20:30.45+01:00</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://django.readthedocs.org/en/1.5.x/</loc>
<lastmod>2013-10-03T19:20:30.45+01:00</lastmod>
<changefreq>never</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://django.readthedocs.org/en/0.1.x/</loc>
<lastmod>2013-10-03T19:20:30.45+01:00</lastmod>
<changefreq>never</changefreq>
<priority>0.1</priority>
</url>
</urlset>
Opt in sounds good (at least for now) -- we should however write a guide about how to enable it, documenting our integration and how users can enable it (once we build the integration :D)
I am on the fence (only slightly!) on making this a Django view. We've been talking more about pushing docs off our servers and to Azure storage and historically served docs entirely from nginx on the community side.
However, we could maybe redirect to a .org endpoint in Azure storage (similar to S3 redirects), or could reverse proxy the request to an API endpoint through Nginx. Worst case for an Azure implementation is we could just plop the sitemap index on the storage on any project save.
I would however be into making this a more integrated feature, like a sitemap: true in our YAML. I'd be :+1: on just making this the default too (perhaps eventually?)
I created a proposal for this at #5122. This initial version is not allowing users to serve their own generated sitemap.xml on the root (/sitemap.xml) but we will get there soon using the sitemapindex tag, hopefully.
I took the sitemap logic out of guzzle_sphinx_theme and made it an extension/package here: https://github.com/jdillard/sphinx-sitemap
It's my first time making a package, but several people are using it successfully and I have it running in a few production environments myself.
Some are even using it on RTD, for example:
* build: [sitemap.xml](http://docs.bonobo-project.org/en/master/sitemap.xml) * source: [conf.py](https://github.com/python-bonobo/bonobo/blob/71039ddcb125a6bf6681cb590dd775d3d8e30dea/docs/conf.py#L25)
If I add sphinx_sitemap to the extensions variable of config.py file, dose it installed automatically on rftd build process?
@omidraha not yet.
In #5122 we will be generating a general sitemap that will leaves at /sitemap.xml. In the near future, we will be generating sitemap indexes which will allow you to generate your own sitemap via sphinx_sitemap and Read the Docs will recognize it and serve it.
The PR with the general sitemap.xml generation is about to get merged. Although, I want to link this comment from David here since it's an important one to consider when working on the next phase (sitemap indexes and more)
This is already implemented https://github.com/rtfd/readthedocs.org/pull/5122
Most helpful comment
Opt in sounds good (at least for now) -- we should however write a guide about how to enable it, documenting our integration and how users can enable it (once we build the integration :D)