Ecma262: JSON.stringify produces invalid UTF-16

Created on 30 Jun 2017 · 11Comments · Source: tc39/ecma262

JSON.stringify is described as returning a "String in UTF-16 encoded JSON format representing an ECMAScript value", but due to the fact that ECMAScript strings can contain non-UTF-16 sequences of code units and that QuoteJSONString does not account for this, the UTF-16 ill-formedness of the value encoded is carried over to the encoding output.

The effect of this is that the JSON string can not be correctly converted to another UTF such as UTF-8, so the conversion might fail or involve inserting replacement characters.
Example strange behaviour when invoking js (SpiderMonkey) or node:

$ js -e 'print(JSON.stringify("\u{10000}".split("")));' | js -e 'print(JSON.parse(readline()).join("") === "\u{10000}");'
false

$ node -e 'console.log(JSON.stringify("\u{10000}".split("")));' | node -e 'require("readline").createInterface({ input: process.stdin, terminal: false }).on("line", l => { console.log(JSON.parse(l).join("") === "\u{10000}"); });'
false

I propose something similar to the fragment below (not currently written in formal ECMAScript style) to be added to the specification for QuoteJSONString, before the final "Else" that currently exists.

else if C is a low surrogate (C >= 0xD800 && C < 0xDC00),
    concatenate the \uHHHH notation of C to product
else if C is a high surrogate (C >= 0xDC00 && C < 0xE000),
    if it is the last code unit,
        concatenate the \uHHHH notation of C to product
    else,
        let N be the next code unit
        if N is a low surrogate (N >= 0xD800 && N < 0xDC00),
            concatenate C to product
            concatenate N to product
            progress iteration to the next code unit (N)
        else,
            concatenate the \uHHHH notation of C to product

Note that this change would only have a minimal effect on the encoding of strings that are not well-formed UTF-16 code unit sequences (and will most likely fail to translate to other UTFs). Any well-formed UTF-16 code unit sequence (a sequence where every member is either a high surrogate code unit followed by a low surrogate code unit or is a non-surrogate code unit) is encoded in the same way as is currently specified.

proposal

Source

Maxdamantus

Most helpful comment

I couple follow-up thoughts to the old discussion.

Perhaps it's time to consider extending JSON.stringify to accept an options object. Note that under the current spec (JSON.stringify step 4) the second argument is ignored if it is an object that is neither an array or a function. So the API could be extended to accept an options object in that position.
This issue extends beyond JSON.stringify. Perhaps we should consider adding a String.prototype.escapeForUTF8 method. (or escapeUnmatchedSurrogates )

allenwb on 30 Jun 2017

👍6

All 11 comments

Prior discussion: https://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14

I’m in favor of making JSON.stringify() produce ASCII-safe output through the use of escape sequences _if this change doesn’t break the Web_. Sadly, we don’t know if it does.

mathiasbynens on 30 Jun 2017

I couple follow-up thoughts to the old discussion.

Perhaps it's time to consider extending JSON.stringify to accept an options object. Note that under the current spec (JSON.stringify step 4) the second argument is ignored if it is an object that is neither an array or a function. So the API could be extended to accept an options object in that position.
This issue extends beyond JSON.stringify. Perhaps we should consider adding a String.prototype.escapeForUTF8 method. (or escapeUnmatchedSurrogates )

allenwb on 30 Jun 2017

👍6

I think it's at least worth pointing out the easy fix, which is to change the phrase "String in UTF-16 encoded JSON format" to "String in JSON format".

domenic on 30 Jun 2017

👍4

If you need UTF-8, you need to replace unpaired surrogates in a string with U+FFFD. Most platform APIs will do this for you. Perhaps that's a primitive we should expose somehow...

annevk on 30 Jun 2017

I’m curious, does escaping not just move the problem around? ES permits invalid Unicode, and even if unpaired surrogate code units are escaped, the result still _describes_ an invalid Unicode string, doesn’t it? Maybe that’s still better than nothing since other agents can then safely decode. Is that the only goal?

bathos on 30 Jun 2017

If you replace unpaired surrogates with U+FFFD, you no longer have code points that cannot be represented by UTF-8 (or UTF-16). I'm not really sure what you mean with "invalid Unicode".

annevk on 30 Jun 2017

@annevk Oh, I misunderstood — I thought the idea was to replace with \u{SOME_SURROGATE}. (Got mixed up when following a link into a related thread).

bathos on 30 Jun 2017

@bathos It describes a Unicode string that is not well-formed, but as mentioned in 3.9 of the standard (http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf), "Unicode strings need not contain well-formed code unit sequences under all conditions". It gives an example there of concatenating ill-formed UTF-16 sequences to create a well-formed UTF-16 sequence, like in my split/join examples.

Maxdamantus on 2 Jul 2017

Thanks, @Maxdamantus. "Ill-formed" is probably the right term for what I meant, though either way, my comment was an error — I got this confused with something proposed elsewhere.

Though re-reading the initial issue here, I guess my comment was still applicable after all? This does seem to be about escaping bad surrogate sequences to \u notation as well. I don’t know what higher-level goals inform this so my two cents are pretty worthless, but my naive expectation would be U+FFFD conversions to maximize interoperability, since JSON is largely a data interchange format; on the other hand there are reasons I can think of why one might not want it to ever be lossy (e.g. perhaps there are string encodings used to transmit binary data that might produce these sequences).

bathos on 2 Jul 2017

I imagine it would be undesirable at this point to produce replacement characters, since parsing the string that is currently produced works perfectly well for the relevant cases just as long as it never gets encoded into another UTF (or more accurately, gets interpreted as UTF-16) in the middle (i.e., JSON.parse(JSON.stringify(s)) === s, regardless of whether s is well-formed UTF-16), and there must be at least some code somewhere that relies on this.

I think generally it should be perfectly fine to encode arbitrary ES strings and that since the JSON encoding is supposedly a textual one based on Unicode characters (eg, {, }, \, u, ", [, …), it should produce an actual well-formed Unicode string. If JSON.stringify were able to be respecified disregarding existing applications, I'd actually be approximately in the middle of thinking it should be well-formed Unicode and thinking it should be printable ASCII (as @mathiasbynens mentioned). I'd probably lean slightly more towards Unicode, due to space efficiency and the fact that if something can't pass around a well-formed Unicode string nowadays, it's probably due to a bug that should be fixed.

As it currently is, encodeURIComponent(JSON.stringify("\u{10000}"[0])) is specified to throw a URIError.

Maxdamantus on 3 Jul 2017

👍2

There is now a stage 3 proposal fixing this. Once this lands this can be closed, I expect.