Sphinx: Non-Latin headings are not converted into proper anchor links

Created on 3 Jan 2015  ·  9Comments  ·  Source: sphinx-doc/sphinx

If a heading includes non-Latin characters, they are ignored.

If none of the characters in a heading is Latin, an anchor link like "id1" is created.

It gets really bad if you have 2 headings with the same Latin word and the rest in non-Latin. e.g. "Инструкция по Searchanise" and "Модуль Searchanise"; this way both anchor links will be built upon the same word Searchanise, which really kills it.

You can check a live example here: http://searchanise-supporters-guide.readthedocs.org/ru/latest/

Cyrillic anchor names are valid and should be used (see Wikipedia for example: http://ru.wikipedia.org/wiki/Pantera#.D0.92.D0.BB.D0.B8.D1.8F.D0.BD.D0.B8.D0.B5_.D0.B8_.D1.82.D0.B5.D0.BD.D0.B4.D0.B5.D0.BD.D1.86.D0.B8.D0.B8).

Thanks!


bug wontfix

All 9 comments

_From Takayuki Shimizukawa on 2014-04-11 06:01:26+00:00_

Thanks for reporting.
I can't find the issued location as you mentioned because the live example has many pages.. please give me a location information.

Technically, the behavior rely on the docutils implementation.
http://sourceforge.net/p/docutils/code/HEAD/tree/tags/docutils-0.11/docutils/nodes.py#l2081
Please let me know which docutils version do you use.
Thanks.

_From Konstantin Molchanov on 2014-04-11 08:48:41+00:00_

Apologies for the misleading example. Here are some concrete links:

http://searchanise-supporters-guide.readthedocs.org/ru/latest/widget.html#id1
(heading—Бесплатный)

http://searchanise-supporters-guide.readthedocs.org/ru/latest/magento.html#searchanise
(heading—После установки расширения Searchanise админка недоступна)

http://searchanise-supporters-guide.readthedocs.org/ru/latest/admin.html#id2
(heading—Клиентская панель управления Searchanise; note that in this case, for some reason, even the word Searchanise is ignored).

I'm using docutils v. 0.11.

Should I redirect the issue to docutils then?

_From Takayuki Shimizukawa on 2014-04-11 09:07:46+00:00_

Thanks. I'm looking for the example as you mentioned:

It gets really bad if you have 2 headings with the same Latin word and the rest in non-Latin. e.g. "Инструкция по Searchanise" and "Модуль Searchanise"; this way both anchor links will be built upon the same word Searchanise, which really kills it.

If this behavior is true, it's a bug. However, if these 2 headings generates 2 different ids (as "id1" and "id2"), it's a docutils' current specification I think.

Should I redirect the issue to docutils then?

Yeah, in either case bug or specification, I think it is faster/straight way.

_From Konstantin Molchanov on 2014-04-11 20:34:08+00:00_

OK, I've posted it as a bug at docutils: https://sourceforge.net/p/docutils/bugs/254/

I think it is a bug rather than the desired behavior since many resources like Wikipedia use non-Latin anchor names and are OK with it.

UPD: I haven't gotten any response so far, and, according to how they handle other tickets, I won't get any soon. This is really sad since the feature is crucial for all users who use non-Latin alphabets.

_From Gleb Goncharov on 2014-04-24 10:26:09+00:00_

Hi!

This issue is really annoying for me since I too use Russian headings.

The Docutils issue tracker on SourceForge appears to be kind of dead—the issues are not even reviewed.

Is there any chance this issue is fixed in Sphinx? I really don't think the Docutils people are going to do anything related anytime soon.

Thanks!

_From Takayuki Shimizukawa on 2014-04-24 14:51:44+00:00_

I think it is too hard to override/substitute the ID generation function by Sphinx because the function is in the deep of docutils. If sphinx override it by monkey patch, I think it might be so fragile one.

I think it is very difficult to support multibyte IDs in many outputs. I know HTML5 allows them for its ID system. Do you know how HTMLHelp works? How about LaTeX? How about roff? Sphinx (and docutils) supports many kinds of output formats. And the ID system have to support all of them. If you have a good idea, please post it to the docutils group. I'll join the discussion.

Edit: This may be unrelated to the issue, sorry!

I worked around this issue my prefixing the labels with letters

This may not be related, not sure, but I am having headers like this:

0.4.2 (2020-08-01)
--------------------

converted to id1, id2...

but if I wrote "melon", it'd work.

Is this related to this particular issue? Where in the sphinx doc code are these headers / label generated? Or is it in 100% in docutils?

Is there a config/tool/extension that works around this automatically?

I see mentions to set_id around the net:

When I did

libvcs 0.4.2 (2020-08-01)
---------------------------

Then I got good results: #libvcs-0-4-1-2020-08-01

@tony It is a behavior originated from docutils:
https://repo.or.cz/docutils.git/blob/HEAD:/docutils/docutils/nodes.py#l2196

Was this page helpful?
0 / 5 - 0 ratings