Emscripten: preamble.js: Pointer_stringify() does not respect length when string is non-ASCII

Created on 14 Dec 2017  路  5Comments  路  Source: emscripten-core/emscripten

Regarding: Pointer_stringify(ptr, length)

See: https://github.com/kripken/emscripten/blob/b1acff26c2c9ee802b20af01e53a6fe276a99335/src/preamble.js#L402-L435

When the string that ptr points is found to be non-ASCII (the local variable hasUtf is true), UTF8ToString(ptr) is called. Note that the length argument is disregarded in this case. If ptr is not zero terminated, UTF8ToString will basically read outside of the bounds of the buffer.

Related issue: #4693

good first bug help wanted wontfix

Most helpful comment

I think we probably don't want to fix Pointer_stringify at all, but deprecate it and fix the locations that call to it to instead call AsciiToString() or UTF8ToString() directly. Pointer_stringify has a bit of an antipattern of attempting to guess whether the string is 7-bit ASCII or not (and if not, it assumes it would be UTF-8). It's better to directly call either AsciiToString() or UTF8ToString() (or UTF16ToString() or UTF32ToString()) to convert a string with specific encoding to a JS string.

All 5 comments

Thanks for filing, this does indeed look wrong.

A concrete testcase showing the problem would be useful, if you or someone else wants to make one - we'll want to check that in with the fix anyhow, when we have that.

I think we probably don't want to fix Pointer_stringify at all, but deprecate it and fix the locations that call to it to instead call AsciiToString() or UTF8ToString() directly. Pointer_stringify has a bit of an antipattern of attempting to guess whether the string is 7-bit ASCII or not (and if not, it assumes it would be UTF-8). It's better to directly call either AsciiToString() or UTF8ToString() (or UTF16ToString() or UTF32ToString()) to convert a string with specific encoding to a JS string.

AsciiToString doesn't seem to be documented.

I can't quite tell from the code (and can't test now), but is AsciiToString safe for Latin 1 as well? If not could we get a function for that? I've been calling AsciiToString assuming it's safe for Latin 1, which is probably unwise.

How about PointerToString("ascii")? The function then can internally use TextDecoder to support various encodings, and if no TextDecoder then it can fallback to AsciiToString and UTF8ToString.

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 7 days. Feel free to re-open at any time if this issue is still relevant.

Was this page helpful?
0 / 5 - 0 ratings