Mysql: Encoding issues

Created on 29 Apr 2014  ·  19Comments  ·  Source: mysqljs/mysql

This module just fails so much with character encodings. If the server is not sending UTF-8 back, everything will just be garbage. Because of the parsing architecture of this module, making this work cannot be done in a backwards-compatible way.

bug feature

All 19 comments

This is a pretty big thing I am hoping will get fixed.

Can you elaborate? I'm not able to reproduce.

This module needs to decode the incoming strings from the row packets according to the charset specified in the field packets for that given sequence, not just using what was configured as the charset properties in the connection config (since this is just for the initial negotiation). A good example is to set a column to a latin1 charset, store a character that has the high bit set in that column, connect with this module using the default settings to a server that ignores the charset given by the client and instead uses the latin1 charset of that column and then try to query that row. The character will be decoded as if it was UTF-8, but it is not, typically resulting in the Unicode replacement character to be returned instead.

Since this is just a tracking issue for something the current maintainers are aware of as an issue, it was not necessary to go into great detail here. If you would like me to put together an actual reproduction, rather than just a description, I'd be happy to do so :)! But I'll need to get home from vacation in a week or so first so I have the time to actually do that (and then, it will be done after I complete all the tasks that have accumulated while I have been away on vacation).

I feel like the explanations is pretty clear above, so let me know if it isn't and what any specific questions are :)

When does the server ignore the charset given by the client? Can it enforce another one? If so, do you know how I can simulate that?

I just wanted to understand issue and created test example to reproduce ( mysql2 is also very dumb in treating fields charsets, it assumes that server is using utf8 after connection )

var mysql = require('../index.js');

var c = mysql.createConnection({ debug: true, user: 'user', password: 'pass', database: 'test'})
c.query('CREATE TABLE charset_test ( test varchar(500) ) CHARACTER SET latin1 COLLATE latin1_danish_ci', (err, rows, fields) => console.log(err, fields, rows))
c.query("delete from charset_test", (err, rows, fields) => console.log(err, fields, rows));
var textExamples = ['ютф восемь', 'test тест テスト փորձարկում পরীক্ষা kiểm tra  ', ' ტესტი પરીક્ષણ  מבחן פּרובירן اختبار परीक्षण   '];
textExamples.forEach( (val) => {
  c.query("insert into charset_test(test) values(?)", [val], (err, rows, fields) => console.log(err, fields, rows));
});

c.query("select *  from charset_test", (err, rows, fields) => console.log(err, fields, rows));

last query prints garbage

Thank you, will try it :)

The problem starts with the INSERT. I'm not sure the problem exists on SELECT. I'm trying to insert those test values using Sequel Pro and it prints exactly the same garbage 👊

so the problem is 1) default charset may be different from what client asked 2) result charset can be anything. When we read strings from network packets instead of always doing `readString(..., 'utf8') we need to use field charset ( always sent with column definition packet - http://dev.mysql.com/doc/internals/en/com-query-response.html#column-definition )

MySQL client on OSX (macports):

mysql> delete from charset_test;
Query OK, 6 rows affected (0.01 sec)

mysql> insert into charset_test values ('ютф восемь');
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from charset_test;
+------------+
| test       |
+------------+
| ??? ?????? |
+------------+
1 row in set (0.00 sec)

@dresende maybe it's not a good example :) try to insert something non-utf8 ( but in the table encoding ) and then read with mysql client. I'll try to make a better one

I'm not sure your example is supposed to work :)

actually, I can't reproduce as well. Despite the table/field created with non-utf charset it's returned back as utf to me ( server version - 5.7.10 )

When does the server ignore the charset given by the client?

This is done by starting the server with the command line option --skip-character-set-client-handshake (see https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_character_set_client). I have encountered servers like this a few times, and it's really irritating, especially because there is no reason this module cannot just work out of the box with these servers by using the charset in the fields packet.

Can it enforce another one?

I think there is, but at least, combined with the above command line option and a default charset configured on the server will effectively enforce a different charset.

Also, on top of all this, it's still possible for field packets to use a different charset than the connection charset, even if it was setup correctly. I have encountered this before and can go back to that server and provide some information to help reproduce if you like.

I'm not really hearing an argument for why we should be ignoring the charset from the fields packets, though :)

Also, @dresende, for your tests above, I would be sure to inspect the packets with Wireshark to ensure they are actually sending the characters you are expecting. There has been many times I've pasted in non-latin1 characters into a command prompt or other program and that program actually just messed it up. Not saying that's what happened here, but could be possible. I usually find it best to only post what the underlying character bytes are over the wire when talking about encoding issues, because there are so many things that can play into issues.

looks like we need to depend on some kind of iconv-like module
https://www.npmjs.com/package/iconv-lite ?

@sidorares, yes, we'd have to depend on a library, even if we wanted the current encoding stuff to function correctly (right now, this module literally only works with UTF-8). i.e. if one set the charset connection config today to something like KOI8-R it would not work, AFAIK, since it would still encode/decode as UTF-8.

We might need to depend on iconv-lite (speed) or iconv (more encodings). About the encoding, I tested on command line and Sequel but I'm going to try your option to see how the server behaves.

I've managed to work this around. Here's how I got to it.

Test app to check which charsets are the client and the server using:

var mysql = require('mysql');

var c = mysql.createConnection({
    host: 'localhost',
    database: 'db',
    user: 'user',
    password: 'pwd',
    debug: true
})
c.query("SHOW VARIABLES LIKE  'char%'"); 

Execution:

$ node test.js
<-- HandshakeInitializationPacket
HandshakeInitializationPacket {
  ...
  serverVersion: '5.6.16-64.2-569.wheezy',
  ...
  serverCapabilities1: 63487,
  serverLanguage: 9,   <----
...
--> (120219398) ComQueryPacket
ComQueryPacket { command: 3, sql: 'SHOW VARIABLES LIKE  \'char%\'' } <---

<-- (120219398) ResultSetHeaderPacket
ResultSetHeaderPacket { fieldCount: 2, extra: undefined }

<-- (120219398) FieldPacket
FieldPacket {
  catalog: 'def',
  db: 'information_schema',
  table: 'VARIABLES',
  orgTable: 'VARIABLES',
  name: 'Variable_name',
  orgName: 'VARIABLE_NAME',
  charsetNr: 9,
  length: 64,
  type: 253,
  flags: 1,
  decimals: 0,
  default: undefined,
  zeroFill: false,
  protocol41: true }

<-- (120219398) FieldPacket
FieldPacket {
  catalog: 'def',
  db: 'information_schema',
  table: 'VARIABLES',
  orgTable: 'VARIABLES',
  name: 'Value',
  orgName: 'VARIABLE_VALUE',
  charsetNr: 9,
  length: 1024,
  type: 253,
  flags: 0,
  decimals: 0,
  default: undefined,
  zeroFill: false,
  protocol41: true }

<-- (120219398) EofPacket
EofPacket {
  fieldCount: 254,
  warningCount: 0,
  serverStatus: 34,
  protocol41: true }

<-- (120219398) RowDataPacket
RowDataPacket { Variable_name: 'character_set_client', Value: 'latin2' } <---

<-- (120219398) RowDataPacket
RowDataPacket { Variable_name: 'character_set_connection', Value: 'latin2' } <---

<-- (120219398) RowDataPacket
RowDataPacket { Variable_name: 'character_set_database', Value: 'utf8mb4' }

<-- (120219398) RowDataPacket
RowDataPacket { Variable_name: 'character_set_filesystem', Value: 'binary' }

<-- (120219398) RowDataPacket
RowDataPacket { Variable_name: 'character_set_results', Value: 'latin2' } <---

<-- (120219398) RowDataPacket
RowDataPacket { Variable_name: 'character_set_server', Value: 'latin2' } <---

<-- (120219398) RowDataPacket
RowDataPacket { Variable_name: 'character_set_system', Value: 'utf8' }

Some info:

character_set_client, _connection, _results - these are session-specific.
charsetNr: 9 - /lib/protocol/constants/charsets.js: exports.LATIN2_GENERAL_CI = 9;

According to the data above, the server is in latin2 and the client uses that instead of it's own charset... why? Maybe it's not even set? I don't have any more time to dig into that since I already found a workaround.

My workaround:

I set the charset to UTF8MB4 with a simple query - the first one after initiating connection.
My database all text columns already match the charset - yes they must match!

SET names 'utf8mb4'

New results of charset variables check:

<-- (120229886) RowDataPacket
RowDataPacket { Variable_name: 'character_set_client', Value: 'utf8mb4' }

<-- (120229886) RowDataPacket
RowDataPacket { Variable_name: 'character_set_connection', Value: 'utf8mb4' }

<-- (120229886) RowDataPacket
RowDataPacket { Variable_name: 'character_set_database', Value: 'utf8mb4' }

<-- (120229886) RowDataPacket
RowDataPacket { Variable_name: 'character_set_filesystem', Value: 'binary' }

<-- (120229886) RowDataPacket
RowDataPacket { Variable_name: 'character_set_results', Value: 'utf8mb4' }

<-- (120229886) RowDataPacket
RowDataPacket { Variable_name: 'character_set_server', Value: 'latin2' }

<-- (120229886) RowDataPacket
RowDataPacket { Variable_name: 'character_set_system', Value: 'utf8' }

All in all, the mysql module must use the database's charset! not the server's.
Sometimes there are specific columns with different collation, but I don't know why anyone would do that since UTF8 supports virtually all characters.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

winzig picture winzig  ·  4Comments

wahengchang picture wahengchang  ·  3Comments

skilbjo picture skilbjo  ·  3Comments

acefxlabs picture acefxlabs  ·  4Comments

Axxxx0n picture Axxxx0n  ·  3Comments