While trying to read the following CSV snippet throws an exception in the Cython bindings.
It looks like dtype inference fails and throws an exception when there is a blank line between the header and first data rows.
sample.csv:
lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,Total_amount,Payment_type,Distance_between_service,Time_between_service,Trip_type
2015-01-01 00:34:42,2015-01-01 00:38:34,N,1,-73.922592163085938,40.754528045654297,-73.91363525390625,40.765522003173828,1,.88,5,0.5,0.5,0,0,,6.3,2,1.00,407,1
2015-01-01 00:34:46,2015-01-01 00:47:23,N,1,-73.952751159667969,40.677711486816406,-73.981529235839844,40.658977508544922,1,3.08,12,0.5,0.5,0,0,,13.3,2,.00,196,1
2015-01-01 00:34:44,2015-01-01 00:38:15,N,1,-73.843009948730469,40.71905517578125,-73.846580505371094,40.711566925048828,1,.90,5,0.5,0.5,1.8,0,,7.8,1,69.00,3241,1
2015-01-01 00:34:48,2015-01-01 00:38:08,N,1,-73.860824584960938,40.757793426513672,-73.854042053222656,40.749820709228516,1,.85,5,0.5,0.5,0,0,,6.3,2,.00,299,1
2015-01-01 00:34:53,2015-01-01 01:09:10,N,1,-73.945182800292969,40.783321380615234,-73.9896240234375,40.765449523925781,1,4.91,24.5,0.5,0.5,0,0,,25.8,2,5.00,1287,1
2015-01-01 00:34:55,2015-01-01 00:40:58,N,1,-73.966812133789063,40.714675903320313,-73.949409484863281,40.718437194824219,4,1.20,6.5,0.5,0.5,0,0,,7.8,2,3.00,125,1
2015-01-01 00:34:49,2015-01-01 00:53:10,N,1,-73.930488586425781,40.850131988525391,-73.978057861328125,40.789058685302734,1,6.60,22,0.5,0.5,0,0,,23.3,2,.00,31,1
2015-01-01 00:35:03,2015-01-01 00:35:08,N,5,-73.863899230957031,40.895439147949219,-73.86187744140625,40.894779205322266,1,.13,15,0,0,0,0,,15,1,.00,57,2
Repro:
import cudf
cudf.read_csv('snippet.csv')
Result for `rapidsai-nightly: 0.7.0.dev0, py37_1113, rapidsai-nightly/label/cuda10.0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.7/site-packages/cudf/io/csv.py", line 164, in read_csv
index_col=index_col
File "cudf/bindings/csv.pyx", line 44, in cudf.bindings.csv.cpp_read_csv
File "cudf/bindings/csv.pyx", line 277, in cudf.bindings.csv.cpp_read_csv
File "cudf/bindings/cudf_cpp.pyx", line 271, in cudf.bindings.cudf_cpp.gdf_column_to_column_mem
File "cudf/bindings/cudf_cpp.pyx", line 69, in cudf.bindings.cudf_cpp.gdf_to_np_dtype
KeyError: 8
Result from tip of branch-0.7:
ERROR: 8 in read_csv: no data available for data type inference
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+1206.g50bb9c1a.dirty-py3.7-linux-x86_64.egg/cudf/io/csv.py", line 164,
in read_csv
index_col=index_col
File "cudf/bindings/csv.pyx", line 44, in cudf.bindings.csv.cpp_read_csv
File "cudf/bindings/csv.pyx", line 266, in cudf.bindings.csv.cpp_read_csv
File "cudf/bindings/cudf_cpp.pyx", line 349, in cudf.bindings.cudf_cpp.check_gdf_error
cudf.bindings.GDFError.GDFError: b'GDF_INVALID_API_CALL'
I don't repro the issue on ToT branch-0.7. Tried with the following code:
~~~
lines = ['lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,Trip_distance,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,Total_amount,Payment_type,Distance_between_service,Time_between_service,Trip_type',
'',
'2015-01-01 00:34:42,2015-01-01 00:38:34,N,1,-73.922592163085938,40.754528045654297,-73.91363525390625,40.765522003173828,1,.88,5,0.5,0.5,0,0,,6.3,2,1.00,407,1',
'2015-01-01 00:34:46,2015-01-01 00:47:23,N,1,-73.952751159667969,40.677711486816406,-73.981529235839844,40.658977508544922,1,3.08,12,0.5,0.5,0,0,,13.3,2,.00,196,1',
'2015-01-01 00:34:44,2015-01-01 00:38:15,N,1,-73.843009948730469,40.71905517578125,-73.846580505371094,40.711566925048828,1,.90,5,0.5,0.5,1.8,0,,7.8,1,69.00,3241,1',
'2015-01-01 00:34:48,2015-01-01 00:38:08,N,1,-73.860824584960938,40.757793426513672,-73.854042053222656,40.749820709228516,1,.85,5,0.5,0.5,0,0,,6.3,2,.00,299,1',
'2015-01-01 00:34:53,2015-01-01 01:09:10,N,1,-73.945182800292969,40.783321380615234,-73.9896240234375,40.765449523925781,1,4.91,24.5,0.5,0.5,0,0,,25.8,2,5.00,1287,1',
'2015-01-01 00:34:55,2015-01-01 00:40:58,N,1,-73.966812133789063,40.714675903320313,-73.949409484863281,40.718437194824219,4,1.20,6.5,0.5,0.5,0,0,,7.8,2,3.00,125,1',
'2015-01-01 00:34:49,2015-01-01 00:53:10,N,1,-73.930488586425781,40.850131988525391,-73.978057861328125,40.789058685302734,1,6.60,22,0.5,0.5,0,0,,23.3,2,.00,31,1',
'2015-01-01 00:35:03,2015-01-01 00:35:08,N,5,-73.863899230957031,40.895439147949219,-73.86187744140625,40.894779205322266,1,.13,15,0,0,0,0,,15,1,.00,57,2']
buffer = '\n'.join(lines)
cu_df = read_csv(StringIO(buffer))
~
@randerzander can you please share with me the snippet as a file? There may be something atypical going on with the line terminators here.
It seems other recent changes in the code partially fixed this issue. Now I am getting something different.
I can read the file successfully, but I get a "bad" first row, which should be skipped, given skip_blank_lines defaults to true.
import cudf
df = cudf.read_csv('sample.txt')
df.head().to_pandas()
Attached sample.csv:
sample.txt
Note, I uploaded it as "sample.txt" because GitHub will not let me upload a .csv file 馃し鈥嶁檪
Now I see why the issue did not repro on our side. You have a Windows-style newlines (/r/n), and this is not currently supported in skip_blank_lines implementation. Got the repro now, should be able to fix this today.
Most helpful comment
Now I see why the issue did not repro on our side. You have a Windows-style newlines (/r/n), and this is not currently supported in skip_blank_lines implementation. Got the repro now, should be able to fix this today.