Hi,
We are experiencing a very anoying encoding problem which started with loopback but seems to be nodejs related.
Basically, we just finished developping an API with Loopback based upon an existing SQL_ASCII encoded postgresql database. Since the API has to be in UTF-8, we try to convert the data sent through our API routes to ISO-8859-15 in order to insert them correctly in our base.
No matter what iconv, utf8, iso-8859 etc. modules we tried, we couldn't get to pass ISO-8859-15 converted strings, we ended up with very strange stuff. For example :
var Iconv = require('iconv').Iconv;
var iconv = new Iconv('UTF-8','ISO-8859-1');
var label = iconv.convert("bébé").toString();
If we insert then the "label" into our database, we end up with someting like that = "b�b�" !
So we just tried to look directly in the Terminal how it behaved with a basic nodejs application (without loopback or any other framework) but it didn't turn out to be better.
Once the Terminal encoding set up to "ISO Latin 1", the following code :
console.log('bébé');
Was displayed this way in the Terminal :
bébé
As if nodejs was completely unable to handle ISO-8859 strings.
Are we missing something there ?
Are we doomed to use UTF-8 string in order to make this work ?
Thanks in advance.
Node.js in general can handle encodings other than UTF-8 quite well.
At least one thing you are not using correctly is the return value of iconv.convert("bébé") – it’s a Buffer object, and it has to, because nothing else would make sense. More concretely:
> iconv.convert("bébé");
<Buffer 62 e9 62 e9>
This is correctly encoded using ISO-8859-1. _However_, when you call .toString() on that buffer, there’s an implied default encoding argument of UTF-8, so decoding the 62 e9 62 e9 as UTF-8 messes it up. But even if you specified the correct encoding, all you would get back is the literal string 'bébé'… what else would you expect?
I am not sure how the database API you are working with looks like, but if it only accepts strings as input, then you need to tell the database the encoding you are using, because the encoding has to be done by it. If it accepts Buffers as input, then you shouldn’t use .toString() on these labels yourself, because the encoding has already been done by you.
As for your second question: If you change the default encoding of your terminal, you should use process.stdout.setDefaultEncoding(). (The encoding name you should use is, binary. It’s the Node.js name for ISO-8859-1, and has this unfortunate name only for historical reasons.)
And fyi, bébé is the ISO-8859-1 interpretation of UTF-8-encoded 'bébé', and b�b� is the ISO-8859-1 interpretation of b�b� (with literal replacement characters). To get to the latter one, you’ll have to encode bébé as ISO-8859-1, interpret the result as an UTF-8-encoded string, convert it to a Buffer using UTF-8 again and interpret _that_ as ISO-8859-1. Impressive.
For future reference, for asking general questions about using Node.js (like this) nodejs/help may be a better place. I’ll close this as not a bug in Node.js.
FWIW, Using buf.toString() should be ok but you'll need to explicitly pass in binary as the encoding...e.g. buf.toString('binary').
> Buffer.from('bébé','binary').toString('binary')
'bébé'
> Buffer.from('bébé','binary').toString()
'b�b�'
>
Thanks both of you for your quick answer.
First, sorry for posting to the wrong place, I'll definitely use nodejs/help for my next tickets but since this one is already well opened, I guess I should continue here.
@jasnell I tried that but unsuccessfully.
@addaleax Thanks for all this interesting info, still a little bit confusing due to the fact I asked several questions and couldn't be sure which one you were advising me for. Could you please be more specific about the only one that matters : how am I supposed to end up with ISO encoded string when I start with a UTF-8 string ? A little snippet to illustrate would be welcome, since I'm not familiar with buffer, so all you talked about is very impressive but a bit abstract to me, again my appologies...
@airzebeth ... running on Node.js v4 with iconv, I just ran the following test:
> const Iconv = require('iconv').Iconv
undefined
> var iconv = new Iconv('UTF-8', 'ISO-8859-1');
undefined
> var label = iconv.convert("bébé").toString();
undefined
> label
'b�b�'
> var label = iconv.convert("bébé").toString('binary');
undefined
> label
'bébé'
> iconv.convert("bébé")
<Buffer 62 e9 62 e9>
>
This indicates that the conversion is happening appropriately and that calling toString('binary') returns the correct results. We can further verify that by doing a quick round trip test:
> var buf = iconv.convert("bébé");
undefined
> var buf2 = new Buffer(buf.toString('binary'), 'binary')
undefined
> buf
<Buffer 62 e9 62 e9>
> buf2
<Buffer 62 e9 62 e9>
In terms of how the characters are actually displayed in your terminal, that's what @addaleax was getting at when mentioning process.stdout.setDefaultEncoding(). You'll need to set the correct value in order for the characters to display correctly.
Btw, in general, Node.js does not actually claim to support ISO-8859-15 characters. It does support binary which we treat as alias for latin-1 which is an alias for ISO-8859-1 but even that is fairly tenuous. For the most part we treat binary as simply being a sequence of raw bytes.
For the most part we treat
binaryas simply being a sequence of raw bytes.
That’s also a bit confusing. :-) I think @trevnorris recently said something in IRC about wanting to deprecate binary as the name for it and adopt latin-1 instead, and I can only express my full support for that.
(i.e. I’ll propose that myself if he doesn’t.)
Yes, definitely agree with that. What I'd like for us to do even more is support using the canonical / standard names for these various encodings. ICU4C (which we already have as a dependency) includes code for determining the canonical name for an encoding given any number of aliases but we turn code off and do not expose it by default.
I'm not sure how you're attempting to print the values string values to verify the output, but remember that your terminal has to be set to the right encoding if you want them to display properly. For example, if I set my terminal to ISO-8859-1 then I get the following:
$ node -e 'process.stdout.write(Buffer([0xa4, 0xa]).toString("binary"), "binary")'
¤
Then change the terminal encoding to ISO-8859-15 and I get this:
$ node -e 'process.stdout.write(Buffer([0xa4, 0xa]).toString("binary"), "binary")'
€
So to verify the data is correct you'll have to do a byte comparison. Whether from a TCP packet or a file, node reads those in exactly as provided by the system.
If we insert then the "label" into our database, we end up with someting like that = "b�b�" !
Can you give me the buffer output? Or the output of buf.toJSON() will work.
Thanks a lot for your help everyone.
With all those suggestions, we undestood a lot of things, stopped testing in the terminal since it was of course confusing and we foud out that the problem was not coming from node, indeed, but from the node-postgres connector we are using : no matter what settings we change, it doesn't seem to be able to force the SQL_ASCII encoding.
We're still on it but at least we made some progress !
Thanks a lot again.
Most helpful comment
FWIW, Using
buf.toString()should be ok but you'll need to explicitly pass inbinaryas the encoding...e.g.buf.toString('binary').