Design: UTF-8 decoding of import/export names in JS

Created on 30 Jan 2017  路  16Comments  路  Source: WebAssembly/design

Related to #968, I noticed that https://github.com/WebAssembly/design/blob/master/Web.md#names says:

Property names in JS are UTF-16 encoded strings. A WebAssembly module may fail validation on the Web if it imports or exports functions whose names do not transcode cleanly to UTF-16 according to the following conversion algorithm

There are at least two problems with this.

The first sentence is simply incorrect. JS property names can be arbitrary strings, and JS strings are arbitrary sequences of (unsigned) 16 bit values. They can happily contain zeros or lonely halves of what would be surrogate pairs. The only relevance of UTF-16 is that some ES6+ library functions assume UTF-16 inputs.

The second statement seems rather problematic. String formats are not prescribed or restricted by Wasm, validation explicitly allows any sequence. Hence I think we must not allow modules with malformed encodings to fail validation, no matter what the platform.

I see several options for resolving the latter (in decreasing order of preference):

  1. Do not perform UTF-8 decoding (UTF-8 to UTF-16 transcoding, really) in the JS API. Instead, simply treat the strings as a sequence of (unsigned) 8 bit values extended to 16 bit code points pointwise.

  2. Do not throw on invalid UTF-8 encodings. Import/export names that do not decode are merely inaccessible from JS.

  3. Change the wording such that this is not described as a validation failure but an instantiation failure in the JS API. In particular, it does not cause W.validate to return false.

importexports

All 16 comments

  1. This would limit names to the Basic Latin (aka ASCII) and Latin-1 Supplement blocks. Seems like an unnecessary restriction to me. It disallows interesting scenarios that could leverage higher code points while still accepting code points not legal in names with many mainstream languages.
  2. I like this. Might be worth a single debug console warning that reports that one or more unusable names are present, but that doesn't have to be part of the spec.
  3. I suppose this depends on the intended use of validate. If it's to ensure spec conformance, I agree, but it doesn't provide a reliable way to check if instantiation will actually work before trying. In the future, this could put an unnecessary burden on JS implementors if WASM gains features not relevant to JS.

@Kardax, re 3, validation is no reliable check for successful instantiation. There are a number of failures that only occur at instantiation/link time, because they depend on the circumstances. Arguably, import/export names fall into this class: you only have a problem when you try to link them via JS objects. In the future, there will probably be ways to link groups of modules more directly that do not require reflecting those names in JS. In that case, there is no particular reason to reject them.

Agreed that the current wording is problematic. I think we shouldn't restrict to latin1 and I like loud errors over silent ones, so I think it makes sense to throw during instantiation if utf8 decoding fails when trying to produce a JS string (option 3).

An issue with the loud error of Option 3 is it makes the JS embedding the least common denominator for a cross-platform (or cross-embedding) WASM file. The binary encoding spec that claims support for any bytes in names becomes unworkable in the real world.

If Option 3 is the consensus, the binary encoding spec should require UTF8. I think this would be beneficial for many reasons, chief among them being that WASM parsers could always treat names as text strings and know it will always work.

I don't think it makes sense to require UTF-8, or anything else, in the core spec. Some non-web environments might have different constraints, and might not even be able to handle UTF-8 strings. And for the web platform, later failure seems pretty much in line with everything else (though soft failure a la 2 would be, too).

  1. sounds like the simplest approach TBH. I'd rather not force UTF-8 in the core spec because wasm isn't JS-specific.

We can also take another approach where invalid code points and mismatched pairs get replaced by the replacement character.

This text should also speak about normalization, see issue #971 for this. Do we:

A. Forbid / error / ignore non-normalized inputs.
B. Automatically normalize wasm inputs, and compare the normalized JS values.
C. Automatically normalize wasm inputs, and expect the user to call normalize.
D. Not normalize anything.

@Kardax I expect that a wholly-non-web ecosystem would not have much interoperability with the web ecosystem (due to totally separate APIs) and thus would be able to define a completely different meaning for imports/exports since there wouldn't be an expectation that these modules were loadable via the web.

@jfbastien Practically speaking, I like being able to reuse current UTF8-to-JS-string functionality and these functions all provide a binary success/failure. Defining a replacement function seems like it would ultimately require a wholly new-with-wasm algorithm which feels a bit too ad hoc to me.

@lukewagner I'm not sure what part of my answer you're answering. I totally agree with your statement, and that's what I implemented as well. I have a mild preference for 2. which matches well with what you say.

My suggestion of replacement character is simply adding one approach to the ones @rossberg-chromium documents. It's something that's done with Unicode, we should consider it.

Normalization is simply missing here, we should talk about it and document it. I have no preference for which approach we take and I'm documenting what I think is possible. Are you expressing a preference, and if so for which option?

@jfbastien Oh, I was just responding to the replacement proposal; I wasn't aware of existing spec-defined algorithms that we could reuse here, though.

Anyhow, I have a preference for 3.

@rossberg-chromium An environment unable to at least convert UTF-8 to a native format wouldn't be sophisticated enough to load WASM in the first place.

@lukewagner That isn't necessarily true. A good example would be a complex math library that has applications for WebGL games and scientific computing. To avoid the need for separate web and non-web builds, a cross-compatible import/export scheme would be required, which would gravitate toward UTF8 (or at least ASCII) to preserve full JS compatibility and the largest possible user base.

@lukewagner OK I forked the normalization question to #971. Let's focus this issue on @rossberg-chromium's 3 proposals + replacement character.

@lukewagner

I expect that a wholly-non-web ecosystem would not have much interoperability with the web ecosystem (due to totally separate APIs) and thus would be able to define a completely different meaning for imports/exports since there wouldn't be an expectation that these modules were loadable via the web.

I am seeing reasons for the non-web ecosystem to interoperate with the web: libraries written in wasm that can be use on and off the web; code for non-web browser contexts that might be developed and tested in a web browser with an emulation API and run at high speed as compiled wasm code in the browser.

Assuming that in practice a user defined translation layer will be used (to decompress, or bake in code variations, or take advantage of browser specific features) then couldn't that implement character translation? It would appear to be able to implement option 1 (translate utf-8 to ascii), and option 2 by substituting a generic name from a sequence, and also to implement a substitution algorithm. So perhaps exposing the acceptable encodings is a key requirement in this area.

@Kardax That's a fair point. Then another way to think of the UTF8 requirement is that it's just part of the ABI, which is the set of all the conventions that statically- or dynamically-linked modules should obey if they want to work together, but which are not baked into the spec. Then, just as a non-C/C++ language ecosystem might have a totally different stack ABI, it may also want a more structured representation of imports/exports (e.g., that describes language types).

@lukewagner Are you referring to something like C++'s infamous name mangling? Custom sections would be a much better choice, as multiple wildly different platforms can share a single WASM file as long as they don't collide in their custom section names.

There is a fourth option.

When I designed LESv3 for WebAssembly (and I would still urge the CG to encode its flat instruction lists with an LES-compatible syntax), I had to solve the problem of round-tripping arbitrary byte strings to UTF-16 and back in a bijective way (no two byte sequences correspond to the same UTF-16 sequence, and all byte sequences are allowed).

Here's how it works. Any malformed UTF-8 sequence is encoded as a malformed UTF-16 sequence. Specifically, the bytes are normally treated as UTF-8. However, given an arbitrary byte in the range 0x80..0xff, the decoder stores it as 0xDC00 plus that byte (so the result is in the range 0xDC80..0xDCFF) under any of the following circumstances:

  1. The byte is not the beginning of a valid UTF-8 sequence
  2. The byte is the beginning of an overlong UTF-8 sequence
  3. The byte is the beginning of a valid UTF-8 sequence but the decoded value is in the range 0xDC00..0xDFFF (i.e. a low surrogate - second half of a surrogate pair).

(Note: This was not actually my idea.)

With https://github.com/WebAssembly/design/pull/1016, WebAssembly does now specify an encoding and restrictions on strings, so the main concern here is no longer applicable.

I filed #1028 to clarify the text in Web.md, which I believe addresses the remaining concerns.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bobOnGitHub picture bobOnGitHub  路  6Comments

konsoletyper picture konsoletyper  路  6Comments

JimmyVV picture JimmyVV  路  4Comments

frehberg picture frehberg  路  6Comments

nikhedonia picture nikhedonia  路  7Comments