Emscripten: UTF8String ends early if the string contains a NUL

Created on 15 Jun 2019 · 19Comments · Source: emscripten-core/emscripten

I have this simple utility function that takes a char* and length and make it a referencable JS object (puts it in an array, and returns the index of that element). When converted from C U8HEAP to JS, if the string contains a \0 character, the conversion is terminated, ignoring the length specified.

Also, related, but not critical, if the code contains a string with a \0 in it, (maybe I need to \0 it?) the compilation fails (see end)

Sample C code...

#include <stdio.h>
#include <emscripten.h>


int makeString( char const*string, int stringlen ) EMSCRIPTEN_KEEPALIVE;
int makeString( char const*string, int stringlen ) {
    printf( "String: %5.5s\\0%5.5s  len:%d\n", string, string + 6, stringlen );
    int x = EM_ASM_INT( {
        const string = UTF8ToString( $0, $1 );
        return Module.objects.push( string )-1;
    },string, stringlen);
    return x;
}


int main( void ) {
    return EM_ASM_INT ( (
            Module.objects = [];
            var s = "Hello" + String.fromCodePoint( 0 ) + "World";

                    // If the following line is used, the compile fails, which
                    // is why the above uses fromCodePoint.
            //var s = "Hello\0World";

            console.log( "String:", s, s.length );
        var sa;
        var si = allocate( sa= intArrayFromString(s), 'i8', ALLOC_NORMAL);
                var index = Module._makeString( si, sa.length-1 );  // pass length of string, not byte size

        console.log( "Resulting string:", Module.objects[index], Module.objects[index].length );

                return 0;
        ) );
}

Compile with simple emcc -o ./nultest.js nultest.c

Generated output, the error is that the result should also be 11 characters long.
(The string received

String: Hello World 11
String: Hello\0World  len:11
Resulting string: Hello 5

Expected output... the output should be the same as the input.

String: Hello World 11
String: Hello\0World  len:11 
Resulting string: Hello World 11

M:\sack\amalgamate\fs_vfs\nultest>call emcc  -g -o ./nultest.js nultest.c 
emscripten:ERROR: emscript: failure to parse metadata output from compiler backend. raw output is: 

{
"staticBump": 3984,
"declares": ["abort", "sbrk", "memset", "memcpy", "__syscall140", "__syscall146", "__syscall6", "__syscall54", "__lock", "__unlock", "emscripten_asm_const_int", "memset"],"redirects": {},"externs": [],"implementedFunctions": ["_makeString", "_main", "_malloc", "_free", "___stdio_close", "___stdio_write", "___stdio_seek", "___syscall_ret", "___errno_location", "_dummy_731", "___stdout_write", "_isdigit", "_vfprintf", "_fmt_fp", "_pop_arg_long_double", "___vfprintf_internal", "_printf_core", "___lockfile", "___unlockfile", "_out", "_getint", "_pop_arg", "_fmt_x", "_fmt_o", "_fmt_u", "_memchr", "_pad_659", "_wctomb", "_wcrtomb", "___pthread_self_440", "_pthread_self", "___fwritex", "___towrite", "___DOUBLE_BITS_662", "_frexp", "___ofl_lock", "___ofl_unlock", "_fflush", "___fflush_unlocked", "_printf"],"tables": {  "ii": "var FUNCTION_TABLE_ii = [0,___stdio_close];",
  "iidiiii": "var FUNCTION_TABLE_iidiiii = [0,0,0,0,0,_fmt_fp,0,0];",
  "iiii": "var FUNCTION_TABLE_iiii = [0,0,___stdout_write,0,___stdio_write,0,0,0];",
  "jiji": "var FUNCTION_TABLE_jiji = [0,0,0,___stdio_seek];",
  "vii": "var FUNCTION_TABLE_vii = [0,0,0,0,0,0,_pop_arg_long_double,0];"
},"initializers": [],"exports": ["_makeString"],"aliases": {},"cantValidate": "","simd": 0,"simdUint8x16": 0,"simdInt8x16": 0,"simdUint16x8": 0,"simdInt16x8": 0,"simdUint32x4": 0,"simdInt32x4": 0,"simdFloat32x4": 0,"simdFloat64x2": 0,"simdBool8x16": 0,"simdBool16x8": 0,"simdBool32x4": 0,"simdBool64x2": 0,"externUses": ["Int8Array","Int16Array","Int32Array","Int64Array","Float64Array","Math.imul"],"maxGlobalAlign": 16,"namedGlobals": {},"asmConsts": {"1": ["( Module.objects = []; var s = \"Hello\0World\"; console.log( \"String:\", s, s.length ); var sa; var si = allocate( sa= intArrayFromString(buf), 'i8', ALLOC_NORMAL); var index = Module.makeString( si, sa.length ); console.log( \"Resulting string:\", objects[index], objects[index].length ); return 0; )", ["i"], [""]], "0": ["{ const string = UTF8ToString( $0, $1 ); return Module.objects.push( string )-1; }", ["iii"], [""]]}, "invokeFuncs": []
}

Traceback (most recent call last):
  File "H:\dev2\emsdk\fastcomp\emscripten\emcc.py", line 3344, in <module>
    sys.exit(run(sys.argv))
  File "H:\dev2\emsdk\fastcomp\emscripten\emcc.py", line 1953, in run
    final = shared.Building.emscripten(final, target + '.mem', js_libraries)
  File "H:\dev2\emsdk\fastcomp\emscripten\tools\shared.py", line 2228, in emscripten
    emscripten.run(infile, outfile, memfile, js_libraries)
  File "H:\dev2\emsdk\fastcomp\emscripten\emscripten.py", line 2604, in run
    return temp_files.run_and_clean(lambda: emscripter(
  File "H:\dev2\emsdk\fastcomp\emscripten\tools\tempfiles.py", line 105, in run_and_clean
    return func()
  File "H:\dev2\emsdk\fastcomp\emscripten\emscripten.py", line 2605, in <lambda>
    infile, outfile_obj, memfile, libraries, shared.COMPILER_ENGINE, temp_files, shared.DEBUG)
  File "H:\dev2\emsdk\fastcomp\emscripten\emscripten.py", line 92, in emscript_fastcomp
    funcs, metadata, mem_init = parse_fastcomp_output(backend_output, DEBUG)
  File "H:\dev2\emsdk\fastcomp\emscripten\emscripten.py", line 147, in parse_fastcomp_output
    metadata = json.loads(metadata_raw, object_pairs_hook=OrderedDict)
  File "H:\dev2\emsdk\python\2.7.13.1_64bit\python-2.7.13.amd64\LIB\json\__init__.py", line 352, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "H:\dev2\emsdk\python\2.7.13.1_64bit\python-2.7.13.amd64\LIB\json\decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "H:\dev2\emsdk\python\2.7.13.1_64bit\python-2.7.13.amd64\LIB\json\decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 9 column 500 (char 1689)

Source

d3x0r

All 19 comments

This is the expected behavior of UTF8ToString() and stringToUTF8(): these functions deal with marshalling null-terminated strings, so \0 always means the string terminator character. The length specifier is meant to denote length "up to" or "at most", and not the exact length.

If you want to perform marshalling of strings that contain null bytes in the middle, you can create your own JS library functions that perform that way, and they'd take in exact string lengths instead of using null as string terminators.

As for the \0 inside a .c file - not sure about how that should be treated as per C/C++ specifications, but gut feeling says it probably should be valid.

juj on 16 Jun 2019

I took the version emitted into JS, and added test if maxlen <= 0 to then compute the length, otherwise trust the length specified (what I expected the behavior to be), and it works great.

UTF8 Strings can contain NUL characters, so it's kinda a misnomer that is' called UTF8ToString ( I don't know what stringtoUtf8 is; looks like a Kotlin utility; .. do you mean intArrayFromString ? because that has no problem saving the string length )

And EM_ASM( ( var str = "thing\0"; ) ) really isn't much of a NUL... since that gets wrapped into a String because it's JS, no?

d3x0r on 16 Jun 2019

UTF8ToString takes a UTF8 C string (aka char*) and returns a JS string. C strings are null terminated, so the behaviour that you're seeing is what I'd expect.
Eg: http://cpp.sh/43lfk

VirtualTim on 17 Jun 2019

So, JS, which is written in C/C++ doesn't have \0 in strings/
There's no way for a language to support strings beyond it's own defined language constants?
That's really strange.

eg;

var a = "This has a \0 nul character" 
console.log( a, a.codePointAt( 11 )  );

Now, I want that UTF8 String in a C array... which works, and has no issue. so C gets a UTF8 String with 0 codepoint in it.

And, again, fixing the existing function ( overriding the existing one) works fine; since things like printf() goe through that passing just a buffer start and 'undefined' as the max length; Actually all the internal usages don't actually pass a max buffer, but instead rely on a number being < NaN and having no limit... so really the behavior isn't changed at all.

d3x0r on 17 Jun 2019

You're passing in a C string to UTF8ToString. The C standard (7.1.1.1) defines a string as

A string is a contiguous sequence of characters terminated by and including the first null character.

So in C terms the char* "Hello\0World" is the string "Hello\0". That's why strlen("Hello\0World") returns 5.

I think the issue is that UTF8ToString accepts a C string, and you're expecting it to accept a char*.

VirtualTim on 17 Jun 2019

No, I'm passing a UTF8 buffer that contains Nulls, and has a associated length with it.
I'm not dealing with C strings at all.
The Call doesn't say "CToSTring" .
I'm not using any of the stock C API library for strings, because they are not UTF8 compatible.
I AM converting a buffer containing utf8 encoded strings into a string.
And they only barrier to this is the assumption that any time a string exists, it must be terminted by a \0. Internally in many places for dealing with UTF8 I use 0xFF, since there is no utf8 code that doesn't have at least one bit off.

d3x0r on 17 Jun 2019

UTF8ToString accepts C strings, so even though you're passing in an array of chars it's being treated as a C string. The doco even says:

Given a pointer ptr to a null-terminated UTF8-encoded string [...]

Also the C API library is fairly UTF8 compatible. On Linux things like fopen take a UTF8 encoded char.
There's also nothing stopping you from storing UTF8 chars in a char.
You can even do things like this without issue:
std::cout << "Hello" << u8"\u23f1" << "World" << std::endl;
or
printf(u8"\u23f1");

VirtualTim on 17 Jun 2019

So, broke as intended, and no support offered.

function UTF8ArrayToString(u8Array, idx, maxBytesToRead) {
+++            maxBytesToRead = maxBytesToRead | 0;
            var endIdx = idx + maxBytesToRead;
            var endPtr = idx;
            // TextDecoder needs to know the byte length in advance, it doesn't stop on null terminator by itself.
            // Also, use the length info to avoid running tiny strings through TextDecoder, since .subarray() allocates garbage.
            // (As a tiny code save trick, compare endPtr against endIdx using a negation, so that undefined means Infinity)
---  while (u8Array[endPtr] && !(endPtr >= endIdx)) ++endPtr;
+++            if( !maxBytesToRead )
+++                while (u8Array[endPtr]) ++endPtr;
+++            else
+++                endPtr = endIdx;

Since you've (all) repeatedly stated strings must contain a \0, there's no point in having a max; and, again, internal functions don't use it, passing 'undefined' instead, which is no max.

If you know it's always going to measure the string anyway, why would you ever pass a max other than undefined/0?

The C API is 'fairly' UTF8 Compatible, except where it comes to support of codepoint 0.

For an alternative workaround in this specific platform's case (not an issue with any other), could overlong encode 0 as like \xC0\x80 but that is incompatible with any other converter, which would emit \u{fffd}instead.

the Sqlite API does a similar thing, and requires escaping strings with '+char(0)+' . ODBC doesn't have such an issue, taking a character pointer and a length, like most well behaved apis do.

I assume your C++ example (which is not C at all) does osmething like makes that a String object which also contains a length?

V8 API has no issue converting strings... which really this is an implementation of v8.h that emites JS instead of doing the underlaying calls; but it takes a char*, and a optional length, with the default of -1, otherwise using the length.

Because although C and it's standard library behaves a certain way, that's no reason for software implemented with C to be limited by that. (again, JS, LUA, TCL, ... )

d3x0r on 17 Jun 2019

And yes, I didn't go back and re-read the fine print of the documentation.

I saw 'CPointer' (or whatever that was) was deprecated, which only took a char*, so, I just assumed that the next logical thing would be UTF8ArrayToString, which takes a length (was already using that in many places). Why do you pass a length? Because you want a string to be that long, instead of being at the \0. I suppose, although I've never seen it, that one could use char buf[4096]; and think to pass `...ToSTring( buf, sizeof( buf ) ); } instead of 0...

but really? It's not that this is an interop with C to C from C around C. This is TO JS. Why wouldn't you have a method to make a string that is the length I specify?

d3x0r on 17 Jun 2019

So, broke as intended, and no support offered.

Apologies for causing frustration from having chosen a poor function name. In hindsight a different name would have been clearer, the UTF8 there means "null-terminated UTF-8 string".

Since you've (all) repeatedly stated strings must contain a \0, there's no point in having a max; and, again, internal functions don't use it, passing 'undefined' instead, which is no max.

The rationale for having an "up-to" length in stringToUTF8 and UTF8ToString instead of exact length is two-fold:

it allows secure string handling, where caller can statically specify the max length that can be produced (similar to _s variants of printf: https://en.cppreference.com/w/c/io/fprintf). When one is converting a string to a buffer in memory in wasm heap, one can specify the size of the provided buffer and via static reading it can be verified that the buffer will not overflow. Examples of such usage can be found e.g. in https://github.com/emscripten-core/emscripten/blob/1a8a7fd5ea301d6e60baa1634f5531f7146225b9/src/Fetch.js#L386 and https://github.com/emscripten-core/emscripten/blob/1a8a7fd5ea301d6e60baa1634f5531f7146225b9/src/library.js#L3917 . If exact length was used, then this kind of buffer overflow check handling would not be possible, but one would always first have to compute the length of the actual string, turning string marshalling of null-terminated strings into a two pass algorithm that would have to scan through the string twice, and use min(stringLengthSerializedAsUtf8, sizeOfProvidedBuffer) as the buffer write limit.
likewise, it allows marshalling substrings of null-terminated UTF-8 strings without having to first compute the length of the string in wasm heap. Example usage can be seen in https://github.com/emscripten-core/emscripten/blob/1a8a7fd5ea301d6e60baa1634f5531f7146225b9/src/library_webgl.js#L255 . With exact string length, such APIs would first have to compute the length of the string, and then do a min(stringLength-startPos, subStringLength) computation to make sure not to overflow the read.

If you want to convert UTF-8 strings containing \0s to JS strings, you can create a JS library with

function exactUTF8ArrayToString(u8Array, idx, exactBytesToRead) {
  var endIdx = idx + exactBytesToRead;

#if TEXTDECODER == 2
  return UTF8Decoder.decode(
    u8Array.subarray ? u8Array.subarray(idx, endIdx) : new Uint8Array(u8Array.slice(idx, endIdx))
  );
#else // TEXTDECODER == 2
#if TEXTDECODER
  if (endIdx - idx > 16 && u8Array.subarray && UTF8Decoder) {
    return UTF8Decoder.decode(u8Array.subarray(idx, endIdx));
  } else {
#endif // TEXTDECODER
    var str = '';
#if TEXTDECODER
    // If building with TextDecoder, we have already computed the string length above, so test loop end condition against that
    while (idx < endIdx) {
#else
    while (!(idx >= endIdx)) {
#endif
      // For UTF8 byte structure, see:
      // http://en.wikipedia.org/wiki/UTF-8#Description
      // https://www.ietf.org/rfc/rfc2279.txt
      // https://tools.ietf.org/html/rfc3629
      var u0 = u8Array[idx++];
      if (!(u0 & 0x80)) { str += String.fromCharCode(u0); continue; }
      var u1 = u8Array[idx++] & 63;
      if ((u0 & 0xE0) == 0xC0) { str += String.fromCharCode(((u0 & 31) << 6) | u1); continue; }
      var u2 = u8Array[idx++] & 63;
      if ((u0 & 0xF0) == 0xE0) {
        u0 = ((u0 & 15) << 12) | (u1 << 6) | u2;
      } else {
#if ASSERTIONS
        if ((u0 & 0xF8) != 0xF0) warnOnce('Invalid UTF-8 leading byte 0x' + u0.toString(16) + ' encountered when deserializing a UTF-8 string on the asm.js/wasm heap to a JS string!');
#endif
        u0 = ((u0 & 7) << 18) | (u1 << 12) | (u2 << 6) | (u8Array[idx++] & 63);
      }

      if (u0 < 0x10000) {
        str += String.fromCharCode(u0);
      } else {
        var ch = u0 - 0x10000;
        str += String.fromCharCode(0xD800 | (ch >> 10), 0xDC00 | (ch & 0x3FF));
      }
    }
#if TEXTDECODER
  }
#endif // TEXTDECODER
  return str;
#endif // TEXTDECODER == 2
}

function exactUTF8ToString(ptr, exactBytesToRead) {
#if TEXTDECODER == 2
  if (!ptr) return '';
  return UTF8Decoder.decode(HEAPU8.subarray(ptr, ptr + exactBytesToRead));
#else
  return ptr ? ExactUTF8ArrayToString(HEAPU8, ptr, exactBytesToRead) : '';
#endif
}

For converting JS strings to UTF-8 strings without null termination, you can take from the existing stringToUTF8Array function, but drop the last part

  // Null-terminate the pointer to the buffer.
  outU8Array[outIdx] = 0;

to avoid appending a null terminator to the string.

juj on 17 Jun 2019

👍1

Of particular note is that the TextDecoder API does not work with multithreaded WebAssembly, so if you are looking towards multithreading, it may be simplest to just drop the TEXTDECODER specific blocks for concise code.

juj on 17 Jun 2019

This issue has been automatically marked as stale because there has been no activity in the past year. It will be closed automatically if no further activity occurs in the next 7 days. Feel free to re-open at any time if this issue is still relevant.

stale[bot] on 16 Jun 2020

Was there any alternative function that lets me convert UTF8 strings containing \0 maybe added in the last year ?

Re : Making a new function - no need.

Again Just modify the existing function to use the length as specified works fine.

d3x0r on 16 Jun 2020

I think the conclusion is that we define UTF8 string as being terminated by a NULL in emscripten (or at least in these function). If you would like some other behaviour its easy enough to add your own functions (in fact @juj even included the code about).

He also made a good argument for why we defined length as "up-to" , while still being terminated by NULL. I don't think changing that makes sense.

sbc100 on 16 Jun 2020

👍1

@sbc100
Found this issue so I'm using a High Perf Data serialization library called HPS https://github.com/jl2922/hps and I send anarray buffer over wire to Emscripten and then convert back to a std::string then again convert back to a struct using the hps::from_string function provided by this lib, but during this conversation JavaScript throws an warning which says, Invalid UTF-8 leading byte 0x-75 encountered when deserializing a UTF-8 string on the asm.js/wasm heap to a JS string!

When I log the data that was converted, it appear that they are corrupted.

jeffRTC on 14 Dec 2020

That error does not seem to relate to nulls in the input data? (0x-75 is not null?)

If you are marshalling binary array data from JS over to Wasm, you should not be using string marshalling functions to pass the data over, but instead use direct binary data marshalling. Arbitrary binary arraybuffer data cannot be interpreted as UTF-8 strings, or one will get decoding errors.

This issue itself was resolved earlier, I'll close this one out.

juj on 14 Dec 2020

👍1

@juj

The RTC lib that I use is using following code to pass pass data over to WASM.

                    var byteArray = new Uint8Array(evt.data);
                    var size = byteArray.length;
                    var pBuffer = _malloc(size);
                    var heapBytes = new Uint8Array(Module['HEAPU8'].buffer, pBuffer, size);
                    heapBytes.set(byteArray);
                    Module['dynCall_viii'](messageCallback, pBuffer, size, userPointer);
                    _free(pBuffer);

On C++ side, I convert std::vector<std::byte> type to std::string after that I call hps::from_string<struct>(stringMessage) to cast data into struct but at this call the yellow warning appear that Invalid UTF-8 leading byte 0x-75 encountered when deserializing a UTF-8 string on the asm.js/wasm heap to a JS string!

The debugging experience is poor. Next to impossible.

https://developers.google.com/web/updates/2020/12/webassembly

I followed this tutorial but no source code appear on dev tools. The guy is in heaven :)

jeffRTC on 14 Dec 2020

@juj

Just wanted to tell you that I figured out and eventually solved my issue. In my case this happen because I am trying to print what HPS produced. I shall compare size without printing it to string :(

jeffRTC on 14 Dec 2020

There aren't a lot of standalone WASM environments, and all of them currently interact with JS. JS string can have \0 in them without an issue... it's just an issue of the C standard library that strings are nul terminated and don't just have an accompanying length (guess they couldn't think of everything in the 60's).

The interop really should work with JS strings, not C strings.

re -0x75 - that's just an invalid utf-8 code-unit.
'sides C isn't the only thing in the world... C++ and Rust are both compiling to wasm now... what's the function for utf8 to c++ string? (or vice versa)?

d3x0r on 14 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings