I have data files where each line contains a record of data delimited by only a comma. There are no commas within any of the contents of the data so there shouldn't be any escaping issues. The first 22 columns of the table look as follows:
CREATE TABLE trips (
trip_id UInt32,
vendor_id String,
pickup_datetime DateTime,
dropoff_datetime Nullable(DateTime),
store_and_fwd_flag Nullable(FixedString(1)),
rate_code_id Nullable(UInt8),
pickup_longitude Nullable(Float64),
pickup_latitude Nullable(Float64),
dropoff_longitude Nullable(Float64),
dropoff_latitude Nullable(Float64),
passenger_count Nullable(UInt8),
trip_distance Nullable(Float64),
fare_amount Nullable(Float32),
extra Nullable(Float32),
mta_tax Nullable(Float32),
tip_amount Nullable(Float32),
tolls_amount Nullable(Float32),
ehail_fee Nullable(Float32),
improvement_surcharge Nullable(Float32),
total_amount Nullable(Float32),
payment_type Nullable(String),
trip_type Nullable(UInt8),
...
) ENGINE = Log;
When I attempt to import the data with the following command:
time (gunzip -c /home/mark/trips/trips_x*.csv.gz | \
clickhouse-client \
--query="INSERT INTO trips FORMAT CSV")
I'm getting an error message on one of the records:
Code: 27. DB::Exception: Cannot parse input: expected , before: -6,52,,,,,,,,,,,,,,,,,,,,\n504769614,CMT,2011-11-05 06:07:07,2011-11-05 06:12:53,N,1,-73.990061999999995,40.756630000000001,-73.996763999999999,40.73723799999999: (at row 73452711)
Row 73452710:
Column 0, name: trip_id, type: UInt32, parsed text: "504769613"
Column 1, name: vendor_id, type: String, parsed text: "VTS"
Column 2, name: pickup_datetime, type: DateTime, parsed text: "2011-11-04 13:21:00"
Column 3, name: dropoff_datetime, type: Nullable(DateTime), parsed text: "2011-11-04 13:31:00"
Column 4, name: store_and_fwd_flag, type: Nullable(FixedString(1)), parsed text: <EMPTY>
Column 5, name: rate_code_id, type: Nullable(UInt8), parsed text: "1"
Column 6, name: pickup_longitude, type: Nullable(Float64), parsed text: "-74.005681999999993"
Column 7, name: pickup_latitude, type: Nullable(Float64), parsed text: "40.745690000000003"
Column 8, name: dropoff_longitude, type: Nullable(Float64), parsed text: "-73.996515000000002"
Column 9, name: dropoff_latitude, type: Nullable(Float64), parsed text: "40.732435000000002"
Column 10, name: passenger_count, type: Nullable(UInt8), parsed text: "1"
Column 11, name: trip_distance, type: Nullable(Float64), parsed text: "1.3799999999999999"
Column 12, name: fare_amount, type: Nullable(Float32), parsed text: "7.2999999999999998"
Column 13, name: extra, type: Nullable(Float32), parsed text: "0"
Column 14, name: mta_tax, type: Nullable(Float32), parsed text: "0.5"
Column 15, name: tip_amount, type: Nullable(Float32), parsed text: "2"
Column 16, name: tolls_amount, type: Nullable(Float32), parsed text: "0"
Column 17, name: ehail_fee, type: Nullable(Float32), ERROR: text ",,9.800000" is not like Nullable(Float32)
Is there a better way to define the table schema and/or the format parameters in the INSERT INTO statement so that the missing values don't cause such an issue when importing?
CSV format doesn't treat empty as NULL to avoid ambiguity with empty strings.
We should add an option to treat empty as NULL.
Please note, that NULLs support is still shallow. We use NULLs only to import data and to convert to non-Nullable types before any further data analysis.
I suggest to use TSV (TabSeparated) format. Postgres, MySQL, Hive, etc. print NULLs as \N in this format and ClickHouse understands it without ambiguity.
Thanks for the explanation. I'll use sed to convert the data to TSV format.
Just in case it's helpful for anyone looking at this issue in the future this is the command I'm running:
time (gunzip -c /home/mark/trips/trips_x*.csv.gz | \
sed 's/,/\t/g' | \
clickhouse-client \
--query="INSERT INTO trips FORMAT TSV")
Looks like I'll need to transform the blanks to \N. Switching to tabs and using TSV format for importing is raising the same issues.
Yes.
Although we can introduce option to treat empty as NULL rather quickly (few days).
If you could that would be great. In the mean time I've put together a Python script to transform the data:
import sys
for line in sys.stdin:
print ','.join([item if len(item.strip()) else '\N'
for item in line.strip().split(',')])
I'm running the import now and it seems to be running well.
for filename in /home/mark/trips/trips_x*.csv.gz; do
gunzip -c $filename | \
python trans.py | \
clickhouse-client \
--query="INSERT INTO trips FORMAT CSV"
done
This problem occurs specifically for floating point types (Nullable(Float64), Nullable(Float32)). I don't think there is any ambiguity what empty floating point value means. Also this problem occurs for version 1.1.54343, but not for 1.1.54310.
my python script
https://gist.github.com/anjia0532/6db48b0886d91d9a663e5a9fd19f2aaa
src csv
val1,val2,val3
aa,bb,cc
a
a,bb,cc
aa
,bb,cc
aa,
bb,cc
a\a,bb,cc
python cmd python clean_csv.py --src=src.csv --dest=dest.csv --chunksize=50000 --cols --encoding=utf-8 --delimiter=,
dest csv
val1,val2,val3
aa,bb,cc
aa,bb,cc
aa,bb,cc
aa,bb,cc
aa,bb,cc
Resolved in #5625
Most helpful comment
Yes.
Although we can introduce option to treat empty as NULL rather quickly (few days).