_[WIP -- There's a lot of detail to cover so my initial version of this issue will be incomplete but will hopefully be updated (by me :D ). Maybe even a TL;DR:!]_
For some time I've noticed that Google provides astonishingly bad search results for the Godot documentation pages.
(Even allowing for Google's searches getting worse in general & their obscuring a page's actual URL by eliding portions of the path and/or replacing / with >. )
In some cases the Ukranian (language code uk) results got returned so frequently (for an English search) that I wondered if somewhere "uk" was being interpreted as "United Kingdom"!
After once again encountering a "No information is available for this page" message for the first result (a link to https://docs.godotengine.org/en/3.1/classes/class_file.html) from a search for _"file godot"_ (without quotes) I decided to investigate further.
There was only one more link to docs.godotengine.org in the first page of the results and it was to https://docs.godotengine.org/en/3.1/tutorials/io/ which does include page text preview. The other results seemed to all be on https://godotengine.org/qa/ and did have page text preview.
It seems there are multiple intersecting issues that I suspect are the cause of the poor search results:
robots.txt related.
sitemap.xml related.
Language related.
Other related.
Per page language & canonical link related.
Missing pages/broken links (particularly in per language navigation).
robots.txt related.robots.txt in multiple locations (identical content but seemingly not up to date with robots.txt in current master):
and more...
Example: http://example.com/folder/robots.txt
Not a valid robots.txt file. Crawlers don't check for robots.txt files in subdirectories.
sitemap.xml I believe) but it's only usable by a "property owner".i.e. Even if you only want to test (and not update) you need to verify that you own the domain/site/pages/whatever. (Although that might just be the sitemap.xml tester, I don't currently recall exactly.)
Action Required: Anyway, we need a person(s) with sufficient server/DNS access to verify ownership & then run the robots.txt tests / sitemap.xml report.
sitemap.xml related.(Unfortunately, the site doesn't appear to enable linking directly to the results--you'll need to select "HTML & HTTP Headers"/"XML Sitemaps" as appropriate, supply the URL and select Googlebot as the User Agent. )
sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://docs.godotengine.org/en/latest/</loc>
<xhtml:link rel="alternate" hreflang="en" href="https://docs.godotengine.org/en/latest/" />
<xhtml:link rel="alternate" hreflang="de" href="https://docs.godotengine.org/de/latest/" />
</url>
<url>
<loc>https://docs.godotengine.org/de/latest/</loc>
<xhtml:link rel="alternate" hreflang="en" href="https://docs.godotengine.org/en/latest/" />
<xhtml:link rel="alternate" hreflang="de" href="https://docs.godotengine.org/de/latest/" />
</url>
</urlset>
[TODO]
[TODO]
TL;DW: The hreflang links in this commit are all 404: https://github.com/godotengine/godot-docs/commit/b6cb5c3ee7ec86b8730bb14af80c4a4089328175
TL;DW: Relevant canonical link related information in this post suggested in an earlier issue: https://yoast.com/rel-canonical/
[TODO]
TL;DW: On https://docs.godotengine.org/en/latest/classes/class_tween.html (and similar pages) most/all language related links are broken (due to missing class docs?), e.g.:
An example of non-404 language links:
[TODO]
godot-docs issues/commits"Show latest Godot docs in search" https://github.com/godotengine/godot-docs/issues/2912
"Google sometimes returns results for docs in Ukranian" https://github.com/godotengine/godot-docs/issues/1574
"Non-English docs appear on the top of Google search results" https://github.com/godotengine/godot-docs/issues/1520
"Add layout.html template to set hreflang attributes" https://github.com/godotengine/godot-docs/commit/b6cb5c3ee7ec86b8730bb14af80c4a4089328175
"Once 3.0 release, make sure Google search results reflect 3.0 pages" https://github.com/godotengine/godot-docs/issues/899
"Add robots.txt file to prevent indexing outdated docs" https://github.com/godotengine/godot-docs/pull/3002
"Prevent the 3.2 branch from being indexed by search engines" https://github.com/godotengine/godot-docs/pull/3142 / https://github.com/godotengine/godot-docs/commit/c3dfdb8c8fceac85a4e99e05e9a1d76a4751f3cb
[...paused...]
Yeah, I have this problem too where only the old sites are found and the stable / latest are never indexed. Maybe we can put noindex in all the html headers for the old sites pages?
Working on a fix for this.
@mhilbrunner Which specific aspect is "this"? Or do you mean all of them? :)
Note that the underlying issue for the hreflang issue originates with readthedocs and/or a library it uses--I have a tab open somewhere with the details...
Also, FWIW, I know that readthedocs recommend preventing earlier versions from being indexed but I don't think that is the correct way to go because it removes them from Google & that's not useful. If canonical links (in relation to both stable/latest & English/complete/incomplete translations) are used correctly then AFAICT older versions should serve to boost the latest versions when they are indexed.
So, I did a thing. See https://github.com/godotengine/godot-docs/pull/3422.
@follower
robots.txt related
As you correctly point out, all robots.txt files besides the root one should be ignored.
With the change in the above PR I removed the language specific versions from the robots.txt disallow: blacklist, as you are correct that the should help boost the canonical (stable, current) version, and it would be nice if I could find pages that are not in stable - i. e. docs for removed features that were available in Godot 2.1.
sitemap.xml related
While not optimal, this one is mostly fine from a first glance
Per page language & canonical link related.
My PR should fix those.
Missing pages/broken links (particularly in per language navigation)
Those I haven't looked at yet.
Akien created the stable branches for the translations, and the english classdoc is now mirrored to all translations as of today.
Together with my fixes in https://github.com/godotengine/godot-docs/pull/3422, most if not all of these issues should now be fixed (after Google reindexes):
For translations, the class docs are no longer missing with a 404, but exist as mirrors of the english class docs.
All pages now point to the stable version of themselves as canonical, which should lead to Google preferring them over latest or specific versions.
All translated pages now correctly identify themselves as translated versions via hreflang tags (corretly, with full, absolute paths), which should hopefully let Google only show results suitable for the language you're browsing in.
The robots.txt no longer somewhat ineffectively prohibits indexing of older version's pages.
Indeed, using the linked tool https://technicalseo.com/tools/hreflang/ now verifies the links work correctly.
The sitemap seems to be fine as-is.
I'm closing this now, if more specific issues still crop up please let us know by opening a new issue :)
Thanks for your work!