Design: Unicode normalization

Created on 30 Jan 2017 · 6Comments · Source: WebAssembly/design

Forking from #970: JS.md talk about Unicode normalization.

A simple example from that document: my name "Jean-François Bastien" can be normalized two ways with

Ç ↔ C+◌̧

This is a nice gotcha in Unicode. While interfacing between JS and wasm it would be good to know what to expect from producers and consumers. We may choose not to normalize, but we should say so.

I see 4 ways in which we can discuss normalization in JS.md:

Forbid / error / ignore non-normalized inputs.
Automatically normalize wasm inputs, and compare the normalized JS values.
Automatically normalize wasm inputs, and expect the user to call normalize.
Not normalize anything.

If we choose 2. or 3. we should specify which form of normalization we expect (because of course there are multiple forms of normalization).

importexports

Source

jfbastien

Most helpful comment

I think the convertToJSString function Web.md#names already specifies 4. Seems fine to add clarifying text to say that no normalization occurs, though.

lukewagner on 30 Jan 2017

👍4

All 6 comments

1, 2, and 3 seems like a good source for esoteric bugs in JS engines.

I vote 4.

RyanLamansky on 30 Jan 2017

I think the convertToJSString function Web.md#names already specifies 4. Seems fine to add clarifying text to say that no normalization occurs, though.

lukewagner on 30 Jan 2017

👍4

Agreed with @lukewagner.

rossberg on 31 Jan 2017

FWIW, CSS doesn't normalize at all either.

tabatkins on 4 Feb 2017

Yeah, nothing in the web platform uses Unicode normalization, other than string.normalize() in JavaScript and IDNA in URLs. 4 is definitely what you want here.