As the Buffer class currently only supports utf16 in little endian ordering, Buffer cannot be used to read in data from things like OpenType fonts, which by definition are always in big endian ordering, irrespective of the hardware or data reader (see https://www.microsoft.com/typography/OTSpec/otff.htm, "data types", All OpenType fonts use Motorola-style byte ordering (Big Endian)).
Can Buffer be given a utf16be to match the already present utf16le, so that Buffer can be used with all utf16 data, rather than only with data that is in little endian ordering?
For what it鈥檚 worth, buffer.swap16() is there pretty much for this kind of thing. Do you think that would be enough to cover your use case? (It should be, but if your requirements are not matched by it, it would be good to know why.)
This would need to get called in every single place where buffers need to be turned into strings, which in a complex, pure Big Endian data file like an OpenType program is "tons of places", so not having a utf16be serialization but having a swap + toString makes every instance of that construction a potential bug, where new devs may forget to add the crucially important swap instruction, necessitating additional linting and build chain steps just to verify that people remembered to call swap prior to toString, but only for big endian data - that's basically an impossible task without relying on runtime errors.
That's a ton of work that can be avoided by offering utf16be in parallel with utf16le, so that people (and more importantly, code) can be explicit about what's happening in the code. Adding a utf16be will be both safer, DRYer (is that a word?) code.
IIRC the only reason utf16le/ucs2 support exists is because that's (supposedly) how JS strings are stored internally in V8, so that encoding comes free.
I'm not particularly keen on adding more encodings to core. There are third-party modules like iconv-lite that are more suitable for converting to/from other encodings.
To be clear, if there is a swap16, then offering the encoding utf16be but implementing it by under the bonnet "reading in the data, then calling swap16 in-place without the user having to type that" would be perfectly acceptable.
That is, as long as the docs of course have a little note for that encoding to explain that using utf16be incurs some processing overhead.
And in the mean time it might be worth updating the docs to, in the valid encodings section, mention that there is no utf16be but that buffer.swap16() exists. Additionally it's probably important to update the docs for the swap16() function, as most people won't find it if they're searching with terms that make sense in this context: the utf16be term (which people may be thinking of when searching because they know utf16le exists) won't find any hits on the page, and the terms little endian, big endian, or even just endian (which people will use because they know the "correct" names for byte ordering) won't find any results either, as the swap16() documentation does not mention these at all. So even if you know what you should be finding, the docs currently don't let you.
I'm mainly concerned about the performance hit brought with having to check yet another encoding name, especially for an encoding that is not common (I would bet utf16le is just as uncommon). It's not so much the difficulty of the actual implementation of the encoding.
Relabeling this issue as a documentation one then.
@mscdex To be fair V8 internally stores strings as native-endian, and we swap it internally on creation when the machine is big endian: https://github.com/nodejs/node/blob/ff001c12b032c33dd54c6bcbb0bdba4fe549ec27/src/node_buffer.cc#L628-L629
I'm cool with doc improvements.
Anybody up for volunteering for these doc improvements?
Hi maintainer(s), analyzed the details of this "_low-hanging_", "_good first issue_" issue. Saw the timestamp as old as "_Aug 10, 2017_" in last comment, so I gave the simple PR a shot in #21111. Thanks in advance for review or issue comments here. I'll update the PR according, Or happy to close the PR if it's irrelevant.
P.S: Was trying to look out for a test or any simple change with "_good first issue_", but there are a lot of enthusiastic contributors out there that the "_triffic_" is not exactly "_light_" 馃槄 Will keep looking out the labels on the issue list.
P.P.S.: If anyone has advice for any "_deeper_" but still "_good first issue_" with a tiny bit of challenge, it's greatly appreciated. But I have been having a lot of fun updating docs as well. 馃憤
Most helpful comment
I'm mainly concerned about the performance hit brought with having to check yet another encoding name, especially for an encoding that is not common (I would bet utf16le is just as uncommon). It's not so much the difficulty of the actual implementation of the encoding.