Node: Problem with ISO-8859-15 characters

Created on 2 May 2016 · 9Comments · Source: nodejs/node

Hi,
We are experiencing a very anoying encoding problem which started with loopback but seems to be nodejs related.
Basically, we just finished developping an API with Loopback based upon an existing SQL_ASCII encoded postgresql database. Since the API has to be in UTF-8, we try to convert the data sent through our API routes to ISO-8859-15 in order to insert them correctly in our base.
No matter what iconv, utf8, iso-8859 etc. modules we tried, we couldn't get to pass ISO-8859-15 converted strings, we ended up with very strange stuff. For example :

var Iconv  = require('iconv').Iconv;
var iconv = new Iconv('UTF-8','ISO-8859-1');
var label = iconv.convert("bébé").toString();

If we insert then the "label" into our database, we end up with someting like that = "bï¿½bï¿½" !

So we just tried to look directly in the Terminal how it behaved with a basic nodejs application (without loopback or any other framework) but it didn't turn out to be better.
Once the Terminal encoding set up to "ISO Latin 1", the following code :
console.log('bébé');
Was displayed this way in the Terminal :
bÃ©bÃ©

As if nodejs was completely unable to handle ISO-8859 strings.
Are we missing something there ?
Are we doomed to use UTF-8 string in order to make this work ?
Thanks in advance.

i18n-api question

Source

airzebeth

Most helpful comment

FWIW, Using buf.toString() should be ok but you'll need to explicitly pass in binary as the encoding...e.g. buf.toString('binary').

> Buffer.from('bébé','binary').toString('binary')
'bébé'
> Buffer.from('bébé','binary').toString()
'b�b�'
>

jasnell on 2 May 2016

👍2

All 9 comments

Node.js in general can handle encodings other than UTF-8 quite well.

At least one thing you are not using correctly is the return value of iconv.convert("bébé") – it’s a Buffer object, and it has to, because nothing else would make sense. More concretely:

> iconv.convert("bébé");
<Buffer 62 e9 62 e9>

This is correctly encoded using ISO-8859-1. _However_, when you call .toString() on that buffer, there’s an implied default encoding argument of UTF-8, so decoding the 62 e9 62 e9 as UTF-8 messes it up. But even if you specified the correct encoding, all you would get back is the literal string 'bébé'… what else would you expect?

I am not sure how the database API you are working with looks like, but if it only accepts strings as input, then you need to tell the database the encoding you are using, because the encoding has to be done by it. If it accepts Buffers as input, then you shouldn’t use .toString() on these labels yourself, because the encoding has already been done by you.

As for your second question: If you change the default encoding of your terminal, you should use process.stdout.setDefaultEncoding(). (The encoding name you should use is, binary. It’s the Node.js name for ISO-8859-1, and has this unfortunate name only for historical reasons.)

And fyi, bÃ©bÃ© is the ISO-8859-1 interpretation of UTF-8-encoded 'bébé', and bï¿½bï¿½ is the ISO-8859-1 interpretation of b�b� (with literal replacement characters). To get to the latter one, you’ll have to encode bébé as ISO-8859-1, interpret the result as an UTF-8-encoded string, convert it to a Buffer using UTF-8 again and interpret _that_ as ISO-8859-1. Impressive.

For future reference, for asking general questions about using Node.js (like this) nodejs/help may be a better place. I’ll close this as not a bug in Node.js.

addaleax on 2 May 2016

FWIW, Using buf.toString() should be ok but you'll need to explicitly pass in binary as the encoding...e.g. buf.toString('binary').

> Buffer.from('bébé','binary').toString('binary')
'bébé'
> Buffer.from('bébé','binary').toString()
'b�b�'
>

jasnell on 2 May 2016

👍2

Thanks both of you for your quick answer.
First, sorry for posting to the wrong place, I'll definitely use nodejs/help for my next tickets but since this one is already well opened, I guess I should continue here.
@jasnell I tried that but unsuccessfully.
@addaleax Thanks for all this interesting info, still a little bit confusing due to the fact I asked several questions and couldn't be sure which one you were advising me for. Could you please be more specific about the only one that matters : how am I supposed to end up with ISO encoded string when I start with a UTF-8 string ? A little snippet to illustrate would be welcome, since I'm not familiar with buffer, so all you talked about is very impressive but a bit abstract to me, again my appologies...

airzebeth on 2 May 2016

👍1

@airzebeth ... running on Node.js v4 with iconv, I just ran the following test:

> const Iconv = require('iconv').Iconv
undefined
> var iconv = new Iconv('UTF-8', 'ISO-8859-1');
undefined
> var label = iconv.convert("bébé").toString();
undefined
> label
'b�b�'
> var label = iconv.convert("bébé").toString('binary');
undefined
> label
'bébé'
> iconv.convert("bébé")
<Buffer 62 e9 62 e9>
>

This indicates that the conversion is happening appropriately and that calling toString('binary') returns the correct results. We can further verify that by doing a quick round trip test:

> var buf = iconv.convert("bébé");
undefined
> var buf2 = new Buffer(buf.toString('binary'), 'binary')
undefined
> buf
<Buffer 62 e9 62 e9>
> buf2
<Buffer 62 e9 62 e9>

In terms of how the characters are actually displayed in your terminal, that's what @addaleax was getting at when mentioning process.stdout.setDefaultEncoding(). You'll need to set the correct value in order for the characters to display correctly.

Btw, in general, Node.js does not actually claim to support ISO-8859-15 characters. It does support binary which we treat as alias for latin-1 which is an alias for ISO-8859-1 but even that is fairly tenuous. For the most part we treat binary as simply being a sequence of raw bytes.

jasnell on 2 May 2016

For the most part we treat binary as simply being a sequence of raw bytes.

That’s also a bit confusing. :-) I think @trevnorris recently said something in IRC about wanting to deprecate binary as the name for it and adopt latin-1 instead, and I can only express my full support for that.

addaleax on 2 May 2016

(i.e. I’ll propose that myself if he doesn’t.)

addaleax on 2 May 2016

Yes, definitely agree with that. What I'd like for us to do even more is support using the canonical / standard names for these various encodings. ICU4C (which we already have as a dependency) includes code for determining the canonical name for an encoding given any number of aliases but we turn code off and do not expose it by default.

jasnell on 2 May 2016

I'm not sure how you're attempting to print the values string values to verify the output, but remember that your terminal has to be set to the right encoding if you want them to display properly. For example, if I set my terminal to ISO-8859-1 then I get the following:

$ node -e 'process.stdout.write(Buffer([0xa4, 0xa]).toString("binary"), "binary")'
¤

Then change the terminal encoding to ISO-8859-15 and I get this:

$ node -e 'process.stdout.write(Buffer([0xa4, 0xa]).toString("binary"), "binary")'
€

So to verify the data is correct you'll have to do a byte comparison. Whether from a TCP packet or a file, node reads those in exactly as provided by the system.

If we insert then the "label" into our database, we end up with someting like that = "bï¿½bï¿½" !

Can you give me the buffer output? Or the output of buf.toJSON() will work.

trevnorris on 4 May 2016

Thanks a lot for your help everyone.
With all those suggestions, we undestood a lot of things, stopped testing in the terminal since it was of course confusing and we foud out that the problem was not coming from node, indeed, but from the node-postgres connector we are using : no matter what settings we change, it doesn't seem to be able to force the SQL_ASCII encoding.
We're still on it but at least we made some progress !
Thanks a lot again.

airzebeth on 4 May 2016

Was this page helpful?

0 / 5 - 0 ratings