Hugo: Add a "no UTF-8 stripping URL" option

Created on 31 Oct 2017  Â·  7Comments  Â·  Source: gohugoio/hugo

I am working with Chinese content (using UTF-8), while most of the time it generates the right url, sometimes it strips certain Chinese characters from URL.

Some examples of these characters are:

  • 〇
  • â—‹
  • 〡
  • 〤
  • 〢
  • ⺮
  • 〣

When generating a page for each character, i.e.: example.com/post/〇 it generates empty paths example.com/post// .

Steps

To reproduce the bug, add

slug: "foo〇○〡〤〢⺮〣21三bar"

in the front matter of any page Hugo will generate the following stripped path:

http://localhost:1313/post/foo21三bar/`

removing 〇○〡〤〢⺮〣.

*Tested with latest Hugo release: Hugo Static Site Generator v0.30.2 linux/amd64 BuildDate: 2017-10-19T08:34:27-03:00, SO: 4.10.0-37-generic #41-Ubuntu SMP Fri Oct 6 20:20:37 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Ubuntu 17.04 *

(x-post: stackoverflow.com, forum)

Enhancement

Most helpful comment

Am I right that this issue is about the same IRI/IRL (International Resource Identifier/Locator) support as this forum topic https://discourse.gohugo.io/t/bug-feature-hugo-wrong-support-non-acsii-symbols-in-url/8375 and closed issue #3039?

It would be great to avoid converting valid UTF-8 IRI into percent-encoded URIs at least for two reasons:

  1. User-friendly links in non-ASCII, non-English or multilingual sites (though it also depends on a browser).
  2. Readable diffs of generated HTML files in git commits and easier code changes review/support/debug. It's impossible to visually decode and understand links in HTML like this: href="/%D0%BA%D0%BE%D0%BD%D1%82%D0%B0%D0%BA%D1%82%D1%8B/">

And probably better SEO too.

A simple option like EnableIRI (false by default) would be great!

All 7 comments

@marcanuy I'm reopening this. I quoted you a part of the comment describing the current behaviour. I'm sure the original motivation for this "unicode sanitize" was good and founded in file system support or something (that function precedes my time on Hugo).

So, we cannot just change that behaviour, that would break lots of sites. But we could consider adding some "no URL sanitize whatsoever" option.

Great, a configuration flag to avoid it would be really helpful, especially for SEO purposes.

Am I right that this issue is about the same IRI/IRL (International Resource Identifier/Locator) support as this forum topic https://discourse.gohugo.io/t/bug-feature-hugo-wrong-support-non-acsii-symbols-in-url/8375 and closed issue #3039?

It would be great to avoid converting valid UTF-8 IRI into percent-encoded URIs at least for two reasons:

  1. User-friendly links in non-ASCII, non-English or multilingual sites (though it also depends on a browser).
  2. Readable diffs of generated HTML files in git commits and easier code changes review/support/debug. It's impossible to visually decode and understand links in HTML like this: href="/%D0%BA%D0%BE%D0%BD%D1%82%D0%B0%D0%BA%D1%82%D1%8B/">

And probably better SEO too.

A simple option like EnableIRI (false by default) would be great!

期待 unicode sanitize
or add function urldecode transform %e5%a5%bd to 好

A workaround to disable encoding of UTF-8 urls:
<img {{ printf "src='%s%s'" .Site.BaseURL .imageUrl | safeHTMLAttr }} >

a workaround to disable encoding of UTF-8 urls

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

This is still relevant. Another small benefit of this feature (in addition to listed above) would be a smaller size of a generated HTML page if it contains a lot of links. Percent-encoding takes a lot of additional bytes.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

digitalcraftsman picture digitalcraftsman  Â·  3Comments

artelse picture artelse  Â·  3Comments

sigma picture sigma  Â·  3Comments

chrissparksnj picture chrissparksnj  Â·  3Comments

marekr picture marekr  Â·  3Comments