Sphinx: Breaking change to Python domain IDs

Created on 11 Mar 2020 · 15Comments · Source: sphinx-doc/sphinx

Describe the bug

Previously, anchors for Python functions were using underscores, #7236 changed this to dashes.

To Reproduce

Document some Python function whose name contains underscores:

.. py:function:: example_python_function(foo)

    Some function.

Expected behavior

This used to create a fragment identifier #example_python_function , but since #7236 this creates #example-python-function.

Your project

This breaks links to python functions when used with nbsphinx: https://nbsphinx.readthedocs.io/en/0.5.1/markdown-cells.html#Links-to-Domain-Objects

Apart from that all links (containing underscores) from external sites to Python API docs created by Sphinx (which I guess are a lot) will break!

bug domain

Source

mgeier

Most helpful comment

Done. Closing.

tk0miya on 29 Mar 2020

🎉2

All 15 comments

Are you sure the old links are broken? While the permalink is indeed using a new ID generation scheme, the old ID should still be attached to the declaration (using <span id="theOldID"></span>).

jakobandersen on 11 Mar 2020

Yes, I changed the style of node_ids in #7236. Therefore, the main hyperlink anchor will be changed in the next release. But old-styled node_ids are still available. So old hyperlinks are still working.

There are some reasons why I changed them. First is that their naming is very simple and getting conflicted often (refs: #6903). Second is the rule of naming is against docutils specification. Last is that it allows sharing one node_ids to multiple names. For example, it helps to represent io.StringIO and _io.StringIO are the same (or they have the same ID).

To improve both python domain and autodoc, we have to change the structure of the domain and the rule of naming IDs. I don't know the change is harmful. But it is needed to improve Sphinx, I believe.

tk0miya on 11 Mar 2020

Thanks for the quick responses!

Are you sure the old links are broken?

I thought so, but links from external sites seem to be not yet broken.

However, it looks like they will be broken at some point in the future, according to this comment:

https://github.com/sphinx-doc/sphinx/blob/f85b870ad59f39c8637160a4cd4d865ce1e1628e/sphinx/domains/python.py#L367-L370

Whether that happens sooner or later, it will be quite bad.

But that's not actually the situation where I found the problem. The ID change is breaking my Sphinx extension nbsphinx because the code assumed that underscores are not changed to dashes:

https://github.com/spatialaudio/nbsphinx/blob/5be6da7b212e0cfed34ebd7da0ede5501549571d/src/nbsphinx.py#L1446-L1467

This causes a build warning (and a missing link) when running Sphinx.
You can reproduce this by building the nbsphinx docs, see https://nbsphinx.readthedocs.io/en/0.5.1/contributing.html.

I could of course change the implementation of nbsphinx (depending on the Sphinx version), but that would still break all the API links my users have made in their notebooks!

And it would break them depending on which Sphinx version they are using, wouldn't that be horrible?

First is that their naming is very simple and getting conflicted often (refs: #6903).

How does changing #example_python_function to #example-python-function help in this case?

Second is the rule of naming is against docutils specification.

Because underscores are not allowed?

I think it would be worth violating the specification for that.
I don't see any negative consequences to this.
And it has worked fine for many years.

Also, having dots and underscores in link to API docs just looks so much more sensible!

Here's an example: https://sfs-python.readthedocs.io/en/0.5.0/sfs.fd.source.html#sfs.fd.source.point_velocity

The link contains the correct Python function name: #sfs.fd.source.point_velocity.

When the ID is changed, this becomes: #sfs-fd-source-point-velocity, which doesn't really make any sense anymore.

Last is that it allows sharing one node_ids to multiple names. For example, it helps to represent io.StringIO and _io.StringIO are the same (or they have the same ID).

I don't understand. Are io and _io not different names in Python?

And what does that have to do with changing underscores to dashes?

mgeier on 12 Mar 2020

First is that their naming is very simple and getting conflicted often (refs: #6903).

How does changing #example_python_function to #example-python-function help in this case?

No, this change is not only replacing underscores by hyphens. New ID Generator tries to generate node_id by following steps; 1) Generate node_id by the given string. But generated one is already used in the document, 2) Generate node_id by sequence number like id0.

It means the node_id is not guess-able by its name. Indeed, it would be almost working fine if we use hyphens and dots for the ID generation. But it will be sometimes broken.

Second is the rule of naming is against docutils specification.

Because underscores are not allowed?

I think it would be worth violating the specification for that.
I don't see any negative consequences to this.
And it has worked fine for many years.

Yes, docutils' spec does not allow to use hyphens and dots in node_id.

I know the current rule for node_id generation is not so wrong. But it surely contains problems. Have you ever try to use invalid characters to the signature? How about multibyte characters?

For example, this is an attacking code for the ID generator:

.. py:function:: "><script>alert('hello sphinx')</script>

I know this is a very mean example and not related to hyphens' problem directly. But our code and docutils do not expect to pass malicious characters as a node_id. I suppose dots and hyphens may not harm our code. But we need to investigate all of our code to prove the safety.

Last is that it allows sharing one node_ids to multiple names. For example, it helps to represent io.StringIO and _io.StringIO are the same (or they have the same ID).

I don't understand. Are io and _io not different names in Python?

And what does that have to do with changing underscores to dashes?

Indeed, io and _io are different names in python interpreter. But please read the python-doc. The latter one is not documented in it. We have some issues to document a python object as "public name" instead of "canonical name". see #4065. It is one of them. This feature is not implemented yet. But I'll do that in the (nearly) future. It tells us the real name of the living object does not match the documentation of it.

As you know, it is not related to hyphens problem. It also conflicts with the hyperlinks which human builds manually. It's no longer guess-able. If we'll keep using dots and hyphens for node_id, the cross-reference feature is needed to create references for nbsphinx, I think.

tk0miya on 15 Mar 2020

How does changing #example_python_function to #example-python-function help in this case?

No, this change is not only replacing underscores by hyphens. New ID Generator tries to generate node_id by following steps; 1) Generate node_id by the given string. But generated one is already used in the document, 2) Generate node_id by sequence number like id0.

OK, that sounds great. So what about doing that, but also allow underscores (_) and dots (.)?

I know the current rule for node_id generation is not so wrong. But it surely contains problems. Have you ever try to use invalid characters to the signature? How about multibyte characters?

OK, I understand that it might be problematic to allow arbitrary characters/code points.

But what about just adding _ and . to the allowed characters?

For example, this is an attacking code for the ID generator:
.. py:function:: "><script>alert('hello sphinx')</script>

I'm not sure if that's really a problematic case, because the attack would have to come from the document content itself. I'm not a security specialist, so I'm probably wrong.

Anyway, I'm not suggesting to allow arbitrary characters.

_ and . should be safe from a security standpoint, right?

Last is that it allows sharing one node_ids to multiple names. For example, it helps to represent io.StringIO and _io.StringIO are the same (or they have the same ID).

I don't understand. Are io and _io not different names in Python?
And what does that have to do with changing underscores to dashes?

Indeed, io and _io are different names in python interpreter. But please read the python-doc.

I can see that those are not the same name.

What IDs are those supposed to get?

IMHO it would make perfect sense to give them the IDs #io and #_io, respectively, wouldn't it?

The latter one is not documented in it. We have some issues to document a python object as "public name" instead of "canonical name". see #4065. It is one of them. This feature is not implemented yet. But I'll do that in the (nearly) future. It tells us the real name of the living object does not match the documentation of it.

I don't really understand any of this, but would it make a difference if underscores (_) were allowed in IDs?

As you know, it is not related to hyphens problem. It also conflicts with the hyperlinks which human builds manually. It's no longer guess-able. If we'll keep using dots and hyphens for node_id, the cross-reference feature is needed to create references for nbsphinx, I think.

I don't understand.
Do you mean I should create my own custom IDs in nbsphinx and overwrite the ones generated by Sphinx?
I guess I will have to do something like that if you are not modifying the way IDs are generated by Sphinx.
I could probably do something similar to https://github.com/spatialaudio/nbsphinx/blob/559fc4e82bc9e2123e546e67b8032643c87cfaf6/src/nbsphinx.py#L1384-L1407.

I do understand that IDs should be unique per HTML page, and I don't mind if the second (and third etc.) duplicate is re-written to #id0 etc., but I would really like to have readable and understandable IDs (for Python API links) for the case that there are no duplicate IDs (and for the first one even if there are duplicates). And (probably more importantly?) I would like to avoid too much breakage in the projects of nbsphinx users.

mgeier on 18 Mar 2020

OK, I understand that it might be problematic to allow arbitrary characters/code points.
But what about just adding _ and . to the allowed characters?

Surely, I don't think _ and . will not cause the problem, as you said. But are you satisfied if I agree to support _ and .? How about capital characters too. How about the latin-1 characters? If you need to change the charset, we need to define the allowed charset for node IDs (and confirm they are safe as possible). Is it okay to support only _ and .?

I don't understand.
Do you mean I should create my own custom IDs in nbsphinx and overwrite the ones generated by Sphinx?
I guess I will have to do something like that if you are not modifying the way IDs are generated by Sphinx.

So far, a node_id of a python object had been the same as its name. Since Sphinx-3.0, it will be changed. The new implementation is almost the same as other domains do for the cross-references.

To realize the new cross-reference feature, we use a "reference name" and location info. A reference name equals the name of the object. For example, io, io.StringIO, int, MyClass, MyClass.meth and so on. A location info is a pair of a docname and a node ID.

On building a document, the python domain goes the following steps:

Collects definitions of python objects written by their directives (ex. py:function::, py:class::, etc.) and creates a mapping from the reference-names to the location info.
Every time find cross-reference roles, look up the location info from the mapping using the reference name of the role. For example, when a role :py:mod:`io` found, the python domain look up the location info by the reference name; io. If succeeded, the role is converted to a reference to the specified node in the document.
Renders the references to the output in the arbitrary format.

This means generating URL manually is not recommended. The node_id is not guess-able because it is sometimes auto-generated (ex. id0). I still don't think about how to implement io and _io case. But it will obey the structure above.

Note: docutils spec says node_id should be starts with a letter [a-z]. So _io will be converted into io (if not already registered).

tk0miya on 20 Mar 2020

It's a tricky issue, but I think it would be good to be a bit more permissive on the IDs, and ignore the docutils spec a bit as it is not enforced anyway.

My reasoning behind the ID generation for, primarily the C++ domain, but also the C domain:

IDs should be as stable as possible, as they may be used in external links.
That means, if one permutes declarations, then it should have no effect on their generated IDs. This rules out having counters.
The extreme cases are for example singlehtml and Latex where IDs need to be unique throughout the whole document, and not just per page.
Domains are independent and users may implement their own, so all IDs of a domain should have a unique prefix (as best as possible*). E.g., currently c. for C and _CPPv4 for C++.
Changes are sometimes needed to the ID generation scheme, so nodes should also be assigned their old IDs, though on a best-effort basis. If this gets out of hand we can deprecate the oldest IDs.

For most languages/domains a unique ID can be made up by identifiers (i.e., something like [a-zA-Z_][a-zA-Z_0-9]*), and then a scope separator (e.g., . in Python and C).
The docutils restrictions/guidelines (https://docutils.sourceforge.io/docs/ref/rst/directives.html#identifier-normalization) seems to be based on "old" technologies, HTML 4.1 and CSS 1. Though it doesn't look like this normalisation is applied to IDs coming from Sphinx. Anyway, HTML 5 loosened the restrictions: https://mathiasbynens.be/notes/html5-id-class
Therefore, if we use restrict IDs to non-empty strings matching [A-Za-z0-9_.]* I believe we can guarantee uniqueness almost always. Optionally, we can map . into - to make CSS selection easier (i.e., no escaping), though has this been a problem so far?
Regarding *, domain ID prefixes: if we get really strict on this, we can require domains to register their primary ID prefix they use and enforce uniqueness.

jakobandersen on 20 Mar 2020

I can accept adding . and _ to the allowed character list. I suppose it will harm nobody.

The docutils restrictions/guidelines (https://docutils.sourceforge.io/docs/ref/rst/directives.html#identifier-normalization) seems to be based on "old" technologies, HTML 4.1 and CSS 1. Though it doesn't look like this normalisation is applied to IDs coming from Sphinx. Anyway, HTML 5 loosened the restrictions:

Unfortunately, we still support HTML4 technology... Especially HTMLHelp builder depends on it. I don't know how many users use it. But bug reports are sometimes filed. So it seems used.

So I hesitate to use . and _ to the first character.

Optionally, we can map . into - to make CSS selection easier (i.e., no escaping), though has this been a problem so far?

Unfortunately, I don't know. To be exact, I've never seen the CSS class converted from node ID.

tk0miya on 21 Mar 2020

I made a PR #7356 to allow . and _ in node id generation. It seems nobody objects to support them for node ids. So I'll merge it soon.

tk0miya on 22 Mar 2020

Now I merged #7356 for beta release. But reopened to continue this discussion. Please check it and give me comments. I'd like to refine it before 3.0 final (if needed).

tk0miya on 22 Mar 2020

Thanks! As a last thing I believe it is ok to allow capital letters as well, which would make it much easier to guarantee uniqueness. Maybe I just missed the docutils rationale for converting to lowercase, but I haven't seen any, and I don't know of any output format where IDs are not case sensitive.

jakobandersen on 23 Mar 2020

Thanks @tk0miya, you are the best!

I agree with @jakobandersen that keeping capital letters would be great.

For example, right now both these links work:

... but only the first one creates the yellow highlighting.
It's not necessary to have both, I think the lowercase target should be removed.

The corresponding [source] link respects uppercase, which is good: https://www.sphinx-doc.org/en/master/_modules/sphinx/application.html#TemplateBridge

And the [docs] link back from there also respects uppercase, which is also good: https://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx.application.TemplateBridge

Now the last missing piece would be to remove the lower-case link target and replace it with the correct case.

How about capital characters too.

Yes, please!

How about the latin-1 characters?

I guess we'll have to draw the line somewhere. Python allows many Unicode characters in identifiers, but AFAICT, in practice most people still use only ASCII letters (lower and upper case) and numbers. And, very importantly, underscores. And the dot (.) is used for namespacing, also very important.

I don't care about any other latin-1 or Unicode characters.

mgeier on 24 Mar 2020

@jakobandersen @mgeier Thank you for comment. I agree to use capital characters also.

After applying #7374, we can use node_ids that matches with [a-zA-Z][a-zA-Z0-9._-]*.
The difference from docutils' is:

Allow to use capital alphabet characters
Allow to use dots (".") and underscores ("_") for an identifier without a leading character.

Does this rule make sense?

tk0miya on 24 Mar 2020

👍1

Does this rule make sense?

Yes, looks good to me. Thanks!

jakobandersen on 24 Mar 2020

Done. Closing.

tk0miya on 29 Mar 2020

🎉2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Traceback when running sphinx-build because SuppressWarning can use Sphinx.config before it's set

ewjoachim · 3Comments

`brew install python` fails

ubershmekel · 3Comments

Sphinx 1.4.5 fails to build API documentation for a Flask application

MichiK · 3Comments

stub-columns have no effect in LaTeX output

jessetan · 3Comments

Building another locale with a non-captioned toctree produces `None` captions

miketheman · 3Comments