Html: Add &nnbsp; entity for U+202F

Created on 3 Dec 2019  ·  11Comments  ·  Source: whatwg/html

There's   for U+00A0. It's a full-width no-break space. It can be used between numbers and their short unit names, or in other places.

Typography and regional norms require (or at least recommend) using a thin no-brak space (or narrow no-break space) in several places:

  • As thousands separator, Source or DIN 5008 (to avoid ambiguous presentation of point or comma)
  • Between abbreviated words like “z. B.” (German: zum Beispiel), Source
  • As fine space before certain punctuation in French, Source

(These are the first and best sources I could find now. There may be better or more authoritative sources available, but they're usually hard to find.)

While it is technically possible to create a keyboard layout that produces this character, not many users have this installed and even then it's hard to distinguish it from other space characters when reading and revising text. Most editors don't even show a replacement symbol for this space character.

AFAIK Wikipedia suggests writing   in these places. And that's probably a good idea in team projects as well. But this is actually the wrong character in these places.

To use the correct narrow no-break space, one has to use a different HTML entity representation, like   or   which are frankly hard to remember or recognise.

As a solution, the new entity &nnbsp; should be added to HTML to make it easy to write readable text following the correct typographic rules and recommendations.

additioproposal i18n-alreq i18n-amlreq i18n-mlreq i18n-tracker needs implementer interest parser

Most helpful comment

I believe I've commented previously along the following lines when this has come up:

  1. For wiki projects, it's irrelevant whether this is in HTML. The wiki software processes the wiki syntax before generating HTML output, so wiki software can introduce whatever macro expansions its developers see fit and users find useful.
  2. In the case of HTML itself, I think the backward-compatibility characteristics of this feature request are bad. The requested feature doesn't expand the expressiveness of HTML in any way: You can already express U+202F unescaped in UTF-8 or escaped as a numeric reference. However, if a named entity was added, it would break in the currently-existing HTML parsers (not only in the currently-existing browsers). This could either lead to unwanted breakage or to lead to non-usage of the feature (i.e. using the numeric form or unescaped UTF-8 _anyway_ for better compat).
  3. Making _this_ change would set a precedent for others to request named entities for characters they find important causing a repeat of the previous point over and over again.

All 11 comments

If new entity will be added effort should be coordinated with MathML to keep entity definitions synchronized -- https://w3c.github.io/xml-entities/

Mozilla is not interested in this. I guess that's a bad starting point already? I don't have the best experiences with the Chrome developers, maybe I'll try it there anyway.

Unfortunately, entities is something that's not extensible in HTML, so I can't even run my own little happy solution.

If HTML standard evolves, Mozilla, and others, must follow the new specifications, that's an evidence.

I'm currently interested about having &nnbsp;, or equal, entity for a French wiki project, as narrow non-breaking space is recommended in some cases, as explained by ygoe.

Futhermore, HTML entities exist for a numerous characters, in my opinions, almost never used, like ≺ and such.

In my opinion this would be extremely useful for French authors, but also for other languages. The NNBSP character was initially added to Unicode for Mongolian suffix handling, where it is important to visually distinguish between spaces separating suffixes and those separating words. It is also being proposed as an ideal fit for a morphological separator in the numerous languages written in the Canadian Aboriginal script (see https://github.com/w3c/amlreq/issues/4). An entity would significantly help authors produce correct (and better machine-readable) text in all these languages.

[@annevk could you add i18n-mlreq and i18n-amlreq labels to the repo, so i can alert those folks to the discussion? Thanks.]

Here is an extension of this issue, which i can raise in a new issue if preferred.

There are other invisible characters for which a named character reference would be very useful for producing correctly authored Unicode text, for the same reasons as mentioned in the first comment. Here, for example, is a list of formatting characters used for Arabic, but most are essential characters for all RTL script-based languages.

Characters with entities:

‍
‌
‏
‎

Characters without entities:
RLI
LRI
FSI
PDI
RLE
LRE
PDF
RLM
LRM
CGJ
ALM

Keyboards generally don't address the problem of inputting the characters, but it's also a problem that the characters themselves are invisible. It would really help to have Named character references. As someone who works with people who use these languages, and works with them myself, it seems to me that from a user's perspective it would be well worth the effort to add them. I don't remember why that hasn't happened before now.

(New labels are to be introduced through https://github.com/whatwg/meta.)

(New labels are to be introduced through https://github.com/whatwg/meta.)

I just filed https://github.com/whatwg/meta/issues/182

I believe I've commented previously along the following lines when this has come up:

  1. For wiki projects, it's irrelevant whether this is in HTML. The wiki software processes the wiki syntax before generating HTML output, so wiki software can introduce whatever macro expansions its developers see fit and users find useful.
  2. In the case of HTML itself, I think the backward-compatibility characteristics of this feature request are bad. The requested feature doesn't expand the expressiveness of HTML in any way: You can already express U+202F unescaped in UTF-8 or escaped as a numeric reference. However, if a named entity was added, it would break in the currently-existing HTML parsers (not only in the currently-existing browsers). This could either lead to unwanted breakage or to lead to non-usage of the feature (i.e. using the numeric form or unescaped UTF-8 _anyway_ for better compat).
  3. Making _this_ change would set a precedent for others to request named entities for characters they find important causing a repeat of the previous point over and over again.

Curious to hear what others think, but I tend to agree. Perhaps the best course of action here would be to update https://github.com/whatwg/html/blob/master/FAQ.md and close these type of feature requests.

@hsivonen I think what makes this request a bit different from others is that it's for invisible characters. As @r12a points out, it's hard to work with invisible characters. And letting wiki markup handle it isn't helpful at all: this is something that needs to work across all input modes into HTML, because it has to be reliable and consistent to be useful to the people who need them.

So while I understand your general premise about the update cycle being, potentially, 5 years or so, I think it's worth it in this case. If we want to take the time to batch up all the invisible characters we need to care about so we can do it at once, let's do that and make a coordinated update to the parser that makes languages that need invisible characters easier to typeset in HTML.

What wikis or any other applications do is entirely irrelevant here. And following @hsivonen 's argumentation, any progress is bad. So why care at all? Just leave it forever as it was defined some 30 years ago. Never change a running system (which is generally bad advice).

I'm fully aware that not all existing HTML parsers and renderers will properly handle this overnight when it's added. It'll take time. But we're in the fortunate (and also unfortunate) situation that the number of relevant HTML parsers in use is very limited, and these are actively maintained and automatically updated most of the time. So changes like this will eventually trickle through to all users and in a few years we can benefit from it without worrying too much. If you're not willing to wait such a long time, you shouldn't work in such projects. Web projects already have a large number of dependencies on browsers and this could be just one of them. As soon as you discover that all browsers that support everything else you already need also support this entity, you can safely use it.

Also, of course I can use any Unicode character directly. But this one hasn't made it onto physical or software-defined keyboards. As the NBSP. Or the SHY. Or the MINUS. So this argumentation is moot. Also, of course I can escape any Unicode character by its codepoint value. But nobody will remember those numbers, which means that 1. nobody will be able to fluently write these characters and 2. nobody will be able to fluently read and understand them. This is about as big as a usability fail as it can get. Then, we already have similar entities, like NBSP. Why do they exist? I imagine they exist because they cannot be written with keyboards, their codepoint cannot be remembered, this one is even visually indistinguishable from a more common character (SP) and its use is required sometimes.

While not being strictly "required" and not used as often, NNBSP falls exactly in the same category. So I definitely see reason for its existence as an entity. On the other hand, it doesn't hurt anybody. Any undefined HTML entity is invalid markup, and the "nnbsp" entity is undefined, so it can safely be assigned. As could other invisible Unicode whitespace, like some zero-width characters that affect wrapping and/or hyphenation.

But this one hasn't made it onto physical or software-defined keyboards.

Why is that?

Was this page helpful?
0 / 5 - 0 ratings