Executed Query
SELECT javaHash(convertCharset('a1κ°', 'utf-8', 'utf-16'))
Result of convertCharset
I set a break point at the javaHash() and checked what the value passed by convertCharset().
I expected the value as 97 0 49 0 0 -84 and the size to be 6.
But, an actual value is not.
β335 struct JavaHashImpl
β336 {
β337 static constexpr auto name = "javaHash";
β338 using ReturnType = Int32;
β339
B+ β340 static Int32 apply(const char * data, const size_t size)
β341 {
>β342 UInt32 h = 0;
β343 for (size_t i = 0; i < size; ++i)
β344 h = 31 * h + static_cast<UInt32>(static_cast<Int8>(data[i]));
β345 return static_cast<Int32>(h);
β346 }
(gdb) p size
$132 = 8
(gdb) x/8db data
0x7fff46814150: -1 -2 97 0 49 0 0 -84
Notice that first two bytes are -1 -2. It looks weird.
Versions
ClickHouse client version 19.17.1.1.
Connecting to localhost:19000 as user default.
Connected to ClickHouse server version 19.17.1 revision 54428.
I found what -1 -2 mean.
Those mean BOM(Byte Order Mark) of utf-16.
0xFEFF are used in utf16, (See more details)
So the javaHash() function have to recognize the charsets of source string, because the Java language calculates a hashCode value based on a charset.
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
hash = h = isLatin1() ? StringLatin1.hashCode(value)
: StringUTF16.hashCode(value);
}
return h;
}
But our strings have no information about encoding. The only way to solve is to provide another function javaHashUTF16 that will calculate javaHash under the assumption that the string is in UTF-16.
To avoid BOM, you should specify utf16be or utf16le:
milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16'))
SELECT hex(convertCharset('1', 'utf-8', 'utf-16'))
ββhex(convertCharset('1', 'utf-8', 'utf-16'))ββ
β FFFE3100 β
βββββββββββββββββββββββββββββββββββββββββββββββ
1 rows in set. Elapsed: 0.011 sec.
milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16be'))
SELECT hex(convertCharset('1', 'utf-8', 'utf-16be'))
ββhex(convertCharset('1', 'utf-8', 'utf-16be'))ββ
β 0031 β
βββββββββββββββββββββββββββββββββββββββββββββββββ
1 rows in set. Elapsed: 0.010 sec.
milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16le'))
SELECT hex(convertCharset('1', 'utf-8', 'utf-16le'))
ββhex(convertCharset('1', 'utf-8', 'utf-16le'))ββ
β 3100 β
βββββββββββββββββββββββββββββββββββββββββββββββββ
And probably we should add a function javaHashUTF16LE.
@alexey-milovidov
Thank you for your help.
I have already made a function like javaHashUTF16LE for my job.
If you don't mind. I will make PR about javaHashUTF16LE as soon as possible.