When inserting Float32 or Float64 values over HTTP, I noticed that _sometimes_ their decimal part is missing, or divided by 10000 or 100000, or the float becames inf. This is not systematic and happens only for a few rows (11 out of 500000 in the example below).
The same thing doesn't happen when inserting the same data from a CSV file.
ClickHouse server version : 19.16.2.2, also reproduced with 19.17.3.7.
To reproduce:
DROP TABLE IF EXISTS strange_bug_http;
CREATE TABLE strange_bug_http
(
`row_id` Int32,
`longitude` Float32 CODEC(Gorilla, ZSTD),
`latitude` Float32 CODEC(Gorilla, ZSTD),
`geohash` FixedString(3) CODEC(ZSTD)
)
ENGINE = MergeTree()
ORDER BY row_id;
// for comparison
DROP TABLE IF EXISTS strange_bug_csv;
CREATE TABLE strange_bug_csv
(
`row_id` Int32,
`longitude` Float32 CODEC(Gorilla, ZSTD),
`latitude` Float32 CODEC(Gorilla, ZSTD),
`geohash` FixedString(3) CODEC(ZSTD)
)
ENGINE = MergeTree()
ORDER BY row_id;
You can download the generated files from here.
# pip install python-geohash
# pip install numpy
import numpy as np
import geohash
NUM_POINTS = 500_000
longitude = np.random.rand(NUM_POINTS) * 360 - 180
latitude = np.random.rand(NUM_POINTS) * 180 - 90
@np.vectorize
def geohashEncode(longitude, latitude, precision=3):
return geohash.encode(latitude, longitude, precision)
geohash = geohashEncode(longitude, latitude)
data = b','.join([b'(' + b','.join(map(lambda x: repr(x).encode(),(i,)+tup)) + b')' for i,tup in enumerate(zip(longitude, latitude, geohash))])
with open('floats.bin', 'wb') as f:
f.write(data)
csv = '\n'.join([','.join(map(lambda x: repr(x),(i,)+tup)) for i,tup in enumerate(zip(longitude, latitude, geohash))])
with open('floats.csv', 'w') as f:
f.write(csv)
# pip install uvloop
# pip install aiohttp
import asyncio
import uvloop
import aiohttp
with open('floats.bin', 'rb') as f:
data = f.read()
uvloop.install()
loop = asyncio.get_event_loop()
params = {"database":"default", "query": "INSERT INTO strange_bug_http VALUES"}
async def main(data, params, url="http://localhost:8123/"):
async with aiohttp.ClientSession() as session:
async with session.post(url=url, params=params, data=data) as resp:
return await resp.read()
loop.run_until_complete(main(data, params))
cat floats.csv | clickhouse-client --query="INSERT INTO strange_bug_csv FORMAT CSV"
SELECT
row_id + 1,
longitude,
latitude,
geohashEncode(longitude, latitude, 3) AS computed_geohash,
geohash
FROM [strange_bug_csv/strange_bug_http]
WHERE computed_geohash != geohash;
On the table filled from CSV, it retrurns one row, which is probably due to geohash miscalculation/float precision :
ββplus(row_id, 1)ββ¬βlongitudeββ¬ββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 360307 β -171.5625 β -81.95313 β 01q β 01m β
βββββββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
On the table filled via HTTP, it retrurns the following :
βplus(row_id, 1)ββ¬ββlongitudeββ¬βββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 201272 β -95.921555 β 52 β cc5 β cc7 β
β 241104 β -14.000025 β 35.987846 β ewq β ewm β
β 261022 β 7.000061 β -69.146805 β h5s β h5t β
βββββββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
ββplus(row_id, 1)ββ¬ββlongitudeββ¬βββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 81429 β 174.77354 β 49.000008 β zbs β zbu β
β 101695 β 35.81316 β inf β uzc β uu9 β
β 121610 β -109.74762 β -28.000027 β 3e1 β 3dc β
βββββββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
ββplus(row_id, 1)ββ¬βlongitudeββ¬ββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 300857 β 157.00002 β -69.10218 β p7x β pe8 β
βββββββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
ββplus(row_id, 1)ββ¬βlongitudeββ¬ββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 360307 β -171.5625 β -81.95313 β 01q β 01m β
βββββββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
ββplus(row_id, 1)ββ¬βlongitudeββ¬βββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 360603 β -171 β -56.861885 β 0jy β 0jv β
β 380518 β 119.77515 β -46.000004 β nxv β nxt β
βββββββββββββββββββ΄ββββββββββββ΄βββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
ββplus(row_id, 1)ββ¬ββlongitudeββ¬βββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 420349 β -113.02339 β -11.000004 β 3qp β 3mz β
βββββββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
ββplus(row_id, 1)ββ¬βlongitudeββ¬βββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 480089 β -177.2999 β -42.000057 β 209 β 203 β
βββββββββββββββββββ΄ββββββββββββ΄βββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
For instance, the second and third rows in the results have as longitude -14.000025 and 7.000061 respectively, which is different from the inserted values -14.250967871479986 and 7.608598672771308.
Checking the CSV data :
$ sed -n 241104p floats.csv
241103, -14.250967871479986, 35.98784501339779, 'ewm'
$ sed -n 261022p floats.csv
261021, 7.608598672771308, -69.14680134387808, 'h5t'
Checking the CSV table :
SELECT
row_id + 1,
longitude,
latitude,
geohashEncode(longitude, latitude, 3) AS computed_geohash,
geohash
FROM strange_bug_csv
WHERE (row_id + 1) IN (241104, 261022);
ββplus(row_id, 1)ββ¬ββlongitudeββ¬βββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 241104 β -14.250968 β 35.987846 β ewm β ewm β
β 261022 β 7.6085987 β -69.146805 β h5t β h5t β
βββββββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
Checking the binary data :
def get_binary_row(row_id, data):
p = data.find("({}".format(row_id).encode())
return data[p : p + data[p:].find(b')') + 1]
>>> get_binary_row(241104-1, data)
b"(241103,-14.250967871479986,35.98784501339779,'ewm')"
>>> get_binary_row(261022-1, data)
b"(261021,7.608598672771308,-69.14680134387808,'h5t')"
Any workaround is much appreciated.
Thank you.
Not related to HTTP.
cat f.values |clickhouse-client -q "INSERT INTO strange_bug_val FORMAT Values"
CREATE TABLE strange_bug_val
( a Float32,
b Float32)
ENGINE = Memory;
clickhouse-client -q "select '(-160.32605134916085,37.70584056842162),' from numbers(1000000)" >test.values
cat test.values |clickhouse-client -q "INSERT INTO strange_bug_val FORMAT Values"
SELECT DISTINCT *
FROM strange_bug_val
βββββββββββaββ¬βββββββββbββ
β -160.32605 β 37.70584 β
β -160.32605 β 37.000008 β
β -160 β 37.70584 β
β -160.32605 β inf β
β -inf β 37.70584 β
β -160.32605 β 37 β
ββββββββββββββ΄ββββββββββββ
Reproduced with 19.18.1.1667
No issue with 19.15.5.18
@den-crane Thanks for the concise example.
It seems that with less decimals in the input, wrong values such as 37.000008 disappear, but inf and values without decimals persist.
BTW, about CSV, the cause is FLOAT32 vs FLOAT64
ββplus(row_id, 1)ββ¬βlongitudeββ¬ββlatitudeββ¬βcomputed_geohashββ¬βgeohashββ
β 360307 β -171.5625 β -81.95313 β 01q β 01m β
βββββββββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββββββ΄ββββββββββ
select geohashEncode(cast(-171.5625053737408 as Float64) ,cast(-81.95313575217848 as Float64)) computed_geohash
ββcomputed_geohashββ
β 01mvzbzuru5y β
ββββββββββββββββββββ
select geohashEncode(cast(-171.5625053737408 as Float32) ,cast(-81.95313575217848 as Float32)) computed_geohash
ββcomputed_geohashββ
β 01qjb0bh2hb1 β
ββββββββββββββββββββ
I have also encountered this problem, my workaround is:
Float32 and Float64 columns to StringString values to Float (using toFloat64OrNull or other functions)JFYI fixed: 19.17.4.11-stable , 19.16.5.15-stable
Thank you guys for your reactivity!