Hugo: Incomplete auto heading ID for extended Latin and CJK characters

Created on 14 Dec 2019  路  9Comments  路  Source: gohugoio/hugo

What version of Hugo are you using (hugo version)?

$ hugo version
Hugo Static Site Generator v0.61.0/extended darwin/amd64 BuildDate: unknown

Does this issue reproduce with the latest release?

Yes

Issue description

The following is the status of auto heading ID generation since Hugo v0.60.0.

  • Only ASCII alphanumeric (one-byte) characters are preserved, and extended latin characters (2 bytes) and other international characters (3 bytes) are discarded.
  • Only a small set of punctuations (hyphen, underscore, and slash) are taken into account. As a result, the IDs for headings with other punctuations become less readable or identifiable.

This proved to be caused by goldmark, but the author himself made it clear that it's no issue on goldmark and he won't change the default logic, as stated in the following issues and PR:

Possible solution

So I've wrote an extension for better auto heading ID generation.

I've tested with random heading strings including the ones from Kubernetes site (good place to test against former Hugo version + plenty of multi-language strings), and it appears to generate auto heading IDs on a par with Blackfriday.

I hope Hugo maintainers take a look at the extension and consider adopting it to resolve the issue.

Bug Markdown

Most helpful comment

@bep , many thanks for implementing the ID generation code to resolve this issue. I've tested with the latest master branch and verified that the first issue stated has been resolved.

  • (Resolved) Only ASCII alphanumeric (one-byte) characters are preserved, and extended latin characters (2 bytes) and other international characters (3 bytes) are discarded.

However, it appears that there are some loose ends regarding the other issue below:

  • Only a small set of punctuations (hyphen, underscore, and slash) are taken into account. As a result, the IDs for headings with other punctuations become less readable or identifiable.

On previous versions of Hugo, period and apostrophe were maintained. As a sample, you can test with the following examples:
https://kubernetes.io/docs/setup/release/notes/#downloads-for-v1-17-0
https://kubernetes.io/docs/setup/release/notes/#what-s-new-major-themes

There seems to be no set rule for how to handle punctuation, but it would be ideal if the new code can behave on a par with Blackfriday's, so that all the existing outgoing links to anchors in Hugo sites can remain accessible.

P.S. It's regrettable that I've made the subject of this issue too narrow to highlight the punctuation issue.

All 9 comments

Hey, and thanks for investigating this issue. I saw some ID diffs when testing this, but I kind of closed my eyes as there was so much to do...

I also forgat that we have an anchorize function that is supposed to create compatible IDs:

https://github.com/gohugoio/hugo/blob/a03c631c420a03f9d90699abdf9be7e4fca0ff61/tpl/urls/urls.go#L93

What would be great is if you could:

  1. Copy the blackfriday function above into your extension. It's MIT and that is fine, but keep a link back to the source so people know its origin.
  2. Export it so Hugo can use it in the anchorize func above.

It would be good if you also stated in your README that this "id format" is Blackfriday compatible, so to speak.

@bep : If possible, I'd like to keep the code intact. I think the implementation is functionally equivalent to the algorithm used in Blackfriday (and other library mentioned in its package documentation, not sure which one is original).

Could you provide counter example strings or reasons (performance, etc.) that I have to replace the code with Blackfriday's?

As you are probably aware, the logic of Blackfriday or the extension code is at the very basic level and cannot satisfy more serious needs (e.g. c++ becomes c by both) and my hope is that somebody adds real extended features, such as the slugify I mentioned in the forum.

My point: for any regressions (that are being reported to goldmark, such as incomplete SmartyPants processing), should we borrow code directly from Blackfriday?

And for your second request,
Export it so Hugo can use it in the anchorize func above.
I doubt the function within the extension can be used as-is for Hugo's Anchorize function because it's specific to goldmark's IDs interface. Could you explain how it can be? (FYI, I'm still learning golang)

Could you provide counter example strings or reasons (performance, etc.) that I have to replace the code with Blackfriday's?

The thing is, that would take time for me, so in reality, we wouldn't know until someone complains. I understand if you don't want to pull someone else's code into your project, it was just a suggestion.

@bep : I understand your concern. If auto heading ID and anchorize function must produce the same output, those two should rely on the same logic.

Judging from the commit, the "other library" is the original and Blackfriday used its code.
And it's quite evident that its algorithm is computationally better than the extension code that uses three strings function calls for each heading. So if they are functionally equivalent (presumably), it's desirable to use better logic.

So, how would we go about this? Shall we import the original library or copy its code to make an exported function?

I'll follow your suggestion in general, as you're the expert in Golang and Hugo.

Hi @bep , could we make some progress on this issue?
Let me know which way you prefer to go among the suggestions in the comment above.
And also let me know if there's anything else that I need to augment on the extension (for example, go module).

@bep , many thanks for implementing the ID generation code to resolve this issue. I've tested with the latest master branch and verified that the first issue stated has been resolved.

  • (Resolved) Only ASCII alphanumeric (one-byte) characters are preserved, and extended latin characters (2 bytes) and other international characters (3 bytes) are discarded.

However, it appears that there are some loose ends regarding the other issue below:

  • Only a small set of punctuations (hyphen, underscore, and slash) are taken into account. As a result, the IDs for headings with other punctuations become less readable or identifiable.

On previous versions of Hugo, period and apostrophe were maintained. As a sample, you can test with the following examples:
https://kubernetes.io/docs/setup/release/notes/#downloads-for-v1-17-0
https://kubernetes.io/docs/setup/release/notes/#what-s-new-major-themes

There seems to be no set rule for how to handle punctuation, but it would be ideal if the new code can behave on a par with Blackfriday's, so that all the existing outgoing links to anchors in Hugo sites can remain accessible.

P.S. It's regrettable that I've made the subject of this issue too narrow to highlight the punctuation issue.

This issue is closed. If you still think there are things to do here, raise another issue.

One quick note: Hugo's main strategy in this department is to create GitHub compatible anchor links. I didn't find their spec/implementation (I did not look too hard), but I'm pretty sure that the implementation is correct.

So with that in mind, if you find some cases where the anchors differ from GitHub, then I'll look at it.

This situation isn't ideal. I can add a note that I did ask for the Blackfriday implementation (and tests) in a PR related to this, but didn't get much response, so I implemented my own. I may consider adding "Blackfriday anchors" as an option, but that needs some real convincing for me to spend time on.

If you really want to help this situation, go get some movement in https://talk.commonmark.org/t/anchors-in-markdown/247 -- where people have been washing their hands for the last six years. Which is why we have this situation with many different auto ID implementations, some that support the {#anchorName} syntax (Hugo does, GitHub does not) ...

Hugo has now taken a stand in this: GitHub is a big player and has set the standard in other Markdown related issues (tables etc.), so it is natural to also follow their lead in anchor IDs.

@bep, I appreciate your explanation and https://github.com/gohugoio/hugo/issues/6707.
I see some issue for whitespace handling for auto ID generation, so I'll submit another issue as you suggested.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MunifTanjim picture MunifTanjim  路  3Comments

geddski picture geddski  路  3Comments

vielmetti picture vielmetti  路  3Comments

bep picture bep  路  3Comments

mumblecrunch picture mumblecrunch  路  3Comments