Clickhouse: The convertCharset (s, 'utf-8', 'utf-16') doesn't seem to work properly.

Created on 6 Nov 2019  Β·  5Comments  Β·  Source: ClickHouse/ClickHouse

Executed Query

SELECT javaHash(convertCharset('a1κ°€', 'utf-8', 'utf-16'))

Result of convertCharset
I set a break point at the javaHash() and checked what the value passed by convertCharset().
I expected the value as 97 0 49 0 0 -84 and the size to be 6.
But, an actual value is not.

   β”‚335     struct JavaHashImpl                                                      
   β”‚336     {                                                                        
   β”‚337         static constexpr auto name = "javaHash";                             
   β”‚338         using ReturnType = Int32;                                            
   β”‚339                                                                              
B+ β”‚340         static Int32 apply(const char * data, const size_t size)             
   β”‚341         {                                                                    
  >β”‚342             UInt32 h = 0;                                                    
   β”‚343             for (size_t i = 0; i < size; ++i)                                
   β”‚344                 h = 31 * h + static_cast<UInt32>(static_cast<Int8>(data[i]));
   β”‚345             return static_cast<Int32>(h);                                    
   β”‚346         }                                                                    

(gdb) p size
$132 = 8    

(gdb) x/8db data                                                           
0x7fff46814150: -1      -2      97      0       49      0       0       -84

Notice that first two bytes are -1 -2. It looks weird.

Versions

ClickHouse client version 19.17.1.1.
Connecting to localhost:19000 as user default.
Connected to ClickHouse server version 19.17.1 revision 54428.
invalid question

All 5 comments

I found what -1 -2 mean.
Those mean BOM(Byte Order Mark) of utf-16.
0xFEFF are used in utf16, (See more details)
So the javaHash() function have to recognize the charsets of source string, because the Java language calculates a hashCode value based on a charset.

public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            hash = h = isLatin1() ? StringLatin1.hashCode(value)
                                  : StringUTF16.hashCode(value);
        }
        return h;
    }

But our strings have no information about encoding. The only way to solve is to provide another function javaHashUTF16 that will calculate javaHash under the assumption that the string is in UTF-16.

To avoid BOM, you should specify utf16be or utf16le:

milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16'))

SELECT hex(convertCharset('1', 'utf-8', 'utf-16'))

β”Œβ”€hex(convertCharset('1', 'utf-8', 'utf-16'))─┐
β”‚ FFFE3100                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.011 sec. 

milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16be'))

SELECT hex(convertCharset('1', 'utf-8', 'utf-16be'))

β”Œβ”€hex(convertCharset('1', 'utf-8', 'utf-16be'))─┐
β”‚ 0031                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1 rows in set. Elapsed: 0.010 sec. 

milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16le'))

SELECT hex(convertCharset('1', 'utf-8', 'utf-16le'))

β”Œβ”€hex(convertCharset('1', 'utf-8', 'utf-16le'))─┐
β”‚ 3100                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

And probably we should add a function javaHashUTF16LE.

@alexey-milovidov
Thank you for your help.
I have already made a function like javaHashUTF16LE for my job.
If you don't mind. I will make PR about javaHashUTF16LE as soon as possible.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

innerr picture innerr  Β·  3Comments

bseng picture bseng  Β·  3Comments

zhicwu picture zhicwu  Β·  3Comments

lttPo picture lttPo  Β·  3Comments

goranc picture goranc  Β·  3Comments