Godot-docs: [WIP] Revisiting ;) robots.txt, sitemap.xml & Google search results

Created on 10 Mar 2020  路  5Comments  路  Source: godotengine/godot-docs

_[WIP -- There's a lot of detail to cover so my initial version of this issue will be incomplete but will hopefully be updated (by me :D ). Maybe even a TL;DR:!]_

Prologue

For some time I've noticed that Google provides astonishingly bad search results for the Godot documentation pages.

(Even allowing for Google's searches getting worse in general & their obscuring a page's actual URL by eliding portions of the path and/or replacing / with >. )

In some cases the Ukranian (language code uk) results got returned so frequently (for an English search) that I wondered if somewhere "uk" was being interpreted as "United Kingdom"!

Latest incident

After once again encountering a "No information is available for this page" message for the first result (a link to https://docs.godotengine.org/en/3.1/classes/class_file.html) from a search for _"file godot"_ (without quotes) I decided to investigate further.

There was only one more link to docs.godotengine.org in the first page of the results and it was to https://docs.godotengine.org/en/3.1/tutorials/io/ which does include page text preview. The other results seemed to all be on https://godotengine.org/qa/ and did have page text preview.

Underlying cause(s)

It seems there are multiple intersecting issues that I suspect are the cause of the poor search results:

  • robots.txt related.

  • sitemap.xml related.

    • Language related.

    • Other related.

  • Per page language & canonical link related.

  • Missing pages/broken links (particularly in per language navigation).

Cause: robots.txt related.

[TODO]

Cause: sitemap.xml related.

[TODO]

Language related.

md5-751bd2aec8cb72acdd2d56c37db4b92b

(Unfortunately, the site doesn't appear to enable linking directly to the results--you'll need to select "HTML & HTTP Headers"/"XML Sitemaps" as appropriate, supply the URL and select Googlebot as the User Agent. )

sitemap.xml <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <url> <loc>https://docs.godotengine.org/en/latest/</loc> <xhtml:link rel="alternate" hreflang="en" href="https://docs.godotengine.org/en/latest/" /> <xhtml:link rel="alternate" hreflang="de" href="https://docs.godotengine.org/de/latest/" /> </url> <url> <loc>https://docs.godotengine.org/de/latest/</loc> <xhtml:link rel="alternate" hreflang="en" href="https://docs.godotengine.org/en/latest/" /> <xhtml:link rel="alternate" hreflang="de" href="https://docs.godotengine.org/de/latest/" /> </url> </urlset>

[TODO]

Other related.

[TODO]

Cause: Per page language & canonical link related.

[TODO]

Cause: Missing pages/broken links (particularly in per language navigation).

[TODO]

Pre-existing/related godot-docs issues/commits

[...paused...]

bug enhancement

All 5 comments

Yeah, I have this problem too where only the old sites are found and the stable / latest are never indexed. Maybe we can put noindex in all the html headers for the old sites pages?

Working on a fix for this.

@mhilbrunner Which specific aspect is "this"? Or do you mean all of them? :)

  • Note that the underlying issue for the hreflang issue originates with readthedocs and/or a library it uses--I have a tab open somewhere with the details...

  • Also, FWIW, I know that readthedocs recommend preventing earlier versions from being indexed but I don't think that is the correct way to go because it removes them from Google & that's not useful. If canonical links (in relation to both stable/latest & English/complete/incomplete translations) are used correctly then AFAICT older versions should serve to boost the latest versions when they are indexed.

So, I did a thing. See https://github.com/godotengine/godot-docs/pull/3422.

@follower

robots.txt related

As you correctly point out, all robots.txt files besides the root one should be ignored.
With the change in the above PR I removed the language specific versions from the robots.txt disallow: blacklist, as you are correct that the should help boost the canonical (stable, current) version, and it would be nice if I could find pages that are not in stable - i. e. docs for removed features that were available in Godot 2.1.

sitemap.xml related

While not optimal, this one is mostly fine from a first glance

Per page language & canonical link related.

My PR should fix those.

Missing pages/broken links (particularly in per language navigation)

Those I haven't looked at yet.

Akien created the stable branches for the translations, and the english classdoc is now mirrored to all translations as of today.

Together with my fixes in https://github.com/godotengine/godot-docs/pull/3422, most if not all of these issues should now be fixed (after Google reindexes):

For translations, the class docs are no longer missing with a 404, but exist as mirrors of the english class docs.

All pages now point to the stable version of themselves as canonical, which should lead to Google preferring them over latest or specific versions.

All translated pages now correctly identify themselves as translated versions via hreflang tags (corretly, with full, absolute paths), which should hopefully let Google only show results suitable for the language you're browsing in.

The robots.txt no longer somewhat ineffectively prohibits indexing of older version's pages.

Indeed, using the linked tool https://technicalseo.com/tools/hreflang/ now verifies the links work correctly.

The sitemap seems to be fine as-is.

I'm closing this now, if more specific issues still crop up please let us know by opening a new issue :)

Thanks for your work!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

golddotasksquestions picture golddotasksquestions  路  3Comments

KoopHauss picture KoopHauss  路  3Comments

jcmonkey picture jcmonkey  路  4Comments

creikey picture creikey  路  4Comments

clayjohn picture clayjohn  路  4Comments