Design: Unicode normalization

Created on 30 Jan 2017  Â·  6Comments  Â·  Source: WebAssembly/design

Forking from #970: JS.md talk about Unicode normalization.

A simple example from that document: my name "Jean-François Bastien" can be normalized two ways with

Ç ↔ C+◌̧

This is a nice gotcha in Unicode. While interfacing between JS and wasm it would be good to know what to expect from producers and consumers. We may choose not to normalize, but we should say so.

I see 4 ways in which we can discuss normalization in JS.md:

  1. Forbid / error / ignore non-normalized inputs.
  2. Automatically normalize wasm inputs, and compare the normalized JS values.
  3. Automatically normalize wasm inputs, and expect the user to call normalize.
  4. Not normalize anything.

If we choose 2. or 3. we should specify which form of normalization we expect (because of course there are multiple forms of normalization).

importexports

Most helpful comment

I think the convertToJSString function Web.md#names already specifies 4. Seems fine to add clarifying text to say that no normalization occurs, though.

All 6 comments

1, 2, and 3 seems like a good source for esoteric bugs in JS engines.

I vote 4.

I think the convertToJSString function Web.md#names already specifies 4. Seems fine to add clarifying text to say that no normalization occurs, though.

Agreed with @lukewagner.

FWIW, CSS doesn't normalize at all either.

Yeah, nothing in the web platform uses Unicode normalization, other than string.normalize() in JavaScript and IDNA in URLs. 4 is definitely what you want here.

Sweet. I want to make sure we document these decisions, and it seems we've reached consensus. Closing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bobOnGitHub picture bobOnGitHub  Â·  6Comments

arunetm picture arunetm  Â·  7Comments

artem-v-shamsutdinov picture artem-v-shamsutdinov  Â·  6Comments

void4 picture void4  Â·  5Comments

ghost picture ghost  Â·  7Comments