Is your feature request related to a problem? Please describe.
Sometimes cudf.read_csv fails with
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1581433420693/work/cpp/src/io/csv/legacy/csv_reader_impl.cu
when given the dtype=MY_TYPES argument. For example,
from io import StringIO
import cudf
import numpy as np
my_types = {
'frame_time': str,
'frame_number': int,
'ip_src': str,
'tcp_srcport': np.int32,
'ip_dst': str,
'tcp_dstport': np.int32,
'frame_len': int,
'tcp_flags_syn': bool,
'tcp_flags_fin': bool,
}
s = StringIO("""
"Jul 3, 2017 11:55:58.598308000 UTC","1","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598312000 UTC","2","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598313000 UTC","3","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598314000 UTC","4","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598315000 UTC","5","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598316000 UTC","6","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598317000 UTC","7","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:55:58.598318000 UTC","8","8.254.250.126","80","192.168.10.5","49188","60","0","1"
"Jul 3, 2017 11:56:22.331018000 UTC","20","8.253.185.121","80","192.168.10.14","49486","60","0","1"
"Jul 3, 2017 11:56:22.331021000 UTC","21","8.253.185.121","80","192.168.10.14","49486","60","0","1"
""")
print(cudf.read_csv(s, header=None, names=list(my_types.keys()), dtype=my_types).dtypes)
gives
Traceback (most recent call last):
File "test.py", line 31, in <module>
print(cudf.read_csv(s, header=None, names=list(dtypes.keys()), dtype=dtypes).dtypes)
File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/site-packages/cudf/io/csv.py", line 84, in read_csv
index_col=index_col,
File "cudf/_lib/legacy/csv.pyx", line 37, in cudf._lib.legacy.csv.read_csv
File "cudf/_lib/legacy/csv.pyx", line 227, in cudf._lib.legacy.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1587234373268/work/cpp/src/io/csv/legacy/csv_reader_impl.cu:663: Unsupported data type
While swapping in pandas gives:
frame_time object
frame_number int64
ip_src object
tcp_srcport int32
ip_dst object
tcp_dstport int32
frame_len int64
tcp_flags_syn object
tcp_flags_fin object
dtype: object
(I do wonder if this particular example is hitting a bug, or a problem in my data even; are any of bool, int64, int32 and str actually unsupported?)
Describe the solution you'd like
If it's possible, it would be nice to know _which_ type in MY_TYPES is unsupported. Can
and
be extended to support this?
(And I guess also https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L624 and https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L638. There might be more spots; this is just what I surfaced with some quick grepping around.)
Describe alternatives you've considered
One alternative would be to simply document supported dtypes. If this exists already, I apologize for not finding it (though if this is the case, could we perhaps link or otherwise include the list in the read_csv documentation?).
Additional context
conda env export for the above example
name: rapids14
channels:
- rapidsai-nightly
- nvidia
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=1_llvm
- aiohttp=3.6.2=py37h516909a_0
- appdirs=1.4.3=py_1
- arrow-cpp=0.15.0=py37h090bef1_2
- async-timeout=3.0.1=py_1000
- attrs=19.3.0=py_0
- backcall=0.1.0=py_0
- bleach=3.1.4=pyh9f0ad1d_0
- bokeh=1.4.0=py37hc8dfbb8_1
- boost=1.70.0=py37h9de70de_1
- boost-cpp=1.70.0=h8e57a91_2
- brotli=1.0.7=he1b5a44_1001
- brotlipy=0.7.0=py37h8f50634_1000
- bzip2=1.0.8=h516909a_2
- c-ares=1.15.0=h516909a_1001
- ca-certificates=2020.4.5.1=hecc5488_0
- cairo=1.16.0=hcf35c78_1003
- certifi=2020.4.5.1=py37hc8dfbb8_0
- cffi=1.14.0=py37hd463f26_0
- cfitsio=3.470=hb60a0a2_2
- chardet=3.0.4=py37hc8dfbb8_1006
- click=7.1.1=pyh8c360ce_0
- click-plugins=1.1.1=py_0
- cligj=0.5.0=py_0
- cloudpickle=1.3.0=py_0
- colorcet=2.0.1=py_0
- cryptography=2.8=py37hb09aad4_2
- cudatoolkit=10.1.243=h6bb024c_0
- cudf=0.14.0a200418=py37_3339
- cudnn=7.6.0=cuda10.1_0
- cugraph=0.14.0a200418=py37_299
- cuml=0.14.0a200418=cuda10.1_py37_1429
- cupy=7.3.0=py37h0632833_0
- curl=7.69.1=h33f0ec9_0
- cusignal=0.14.0a200418=py37_179
- cuspatial=0.14.0a200418=py37_169
- cuxfilter=0.14.0a200418=py37_54
- cycler=0.10.0=py_2
- cytoolz=0.10.1=py37h516909a_0
- dask=2.14.0=py_0
- dask-core=2.14.0=py_0
- dask-cuda=0.14.0a200418=py37_43
- dask-cudf=0.14.0a200418=py37_3339
- dask-xgboost=0.2.0.dev28=cuda10.1py36_0
- datashader=0.10.0=py_0
- datashape=0.5.4=py_1
- decorator=4.4.2=py_0
- defusedxml=0.6.0=py_0
- distributed=2.14.0=py37hc8dfbb8_0
- dlpack=0.2=he1b5a44_1
- double-conversion=3.1.5=he1b5a44_2
- entrypoints=0.3=py37hc8dfbb8_1001
- expat=2.2.9=he1b5a44_2
- fastavro=0.23.1=py37h8f50634_0
- fastrlock=0.4=py37h3340039_1001
- fiona=1.8.9.post2=py37hdff7cfa_0
- fontconfig=2.13.1=h86ecdb6_1001
- freetype=2.10.1=he06d7ca_0
- freexl=1.0.5=h14c3975_1002
- fsspec=0.7.2=py_0
- gdal=2.4.4=py37h5f563d9_0
- geopandas=0.7.0=py_1
- geos=3.8.0=he1b5a44_1
- geotiff=1.5.1=h38872f0_8
- gettext=0.19.8.1=hc5be6a0_1002
- gflags=2.2.2=he1b5a44_1002
- giflib=5.1.7=h516909a_1
- glib=2.64.2=h6f030ca_0
- glog=0.4.0=h49b9bf7_3
- grpc-cpp=1.23.0=h18db393_0
- hdf4=4.2.13=hf30be14_1003
- hdf5=1.10.5=nompi_h3c11f04_1104
- heapdict=1.0.1=py_0
- icu=64.2=he1b5a44_1
- idna=2.9=py_1
- imageio=2.8.0=py_0
- importlib-metadata=1.6.0=py37hc8dfbb8_0
- importlib_metadata=1.6.0=0
- ipykernel=5.2.0=py37h43977f1_1
- ipython=7.13.0=py37hc8dfbb8_2
- ipython_genutils=0.2.0=py_1
- jedi=0.17.0=py37hc8dfbb8_0
- jinja2=2.11.2=pyh9f0ad1d_0
- joblib=0.14.1=py_0
- jpeg=9c=h14c3975_1001
- json-c=0.13.1=h14c3975_1001
- jsonschema=3.2.0=py37hc8dfbb8_1
- jupyter-server-proxy=1.3.2=py_0
- jupyter_client=6.1.3=py_0
- jupyter_core=4.6.3=py37hc8dfbb8_1
- kealib=1.4.13=hec59c27_0
- kiwisolver=1.2.0=py37h99015e2_0
- krb5=1.17.1=h2fd8d38_0
- ld_impl_linux-64=2.34=h53a641e_0
- libblas=3.8.0=16_openblas
- libcblas=3.8.0=16_openblas
- libcudf=0.14.0a200418=cuda10.1_3339
- libcugraph=0.14.0a200418=cuda10.1_299
- libcuml=0.14.0a200418=cuda10.1_1429
- libcumlprims=0.14.0a200417=cuda10.1_22
- libcurl=7.69.1=hf7181ac_0
- libcuspatial=0.14.0a200418=cuda10.1_169
- libdap4=3.20.4=hd3bb157_0
- libedit=3.1.20170329=hf8c457e_1001
- libevent=2.1.10=h72c5cf5_0
- libffi=3.2.1=he1b5a44_1007
- libgcc-ng=9.2.0=h24d8f2e_2
- libgdal=2.4.4=h2b6fda6_0
- libgfortran-ng=7.3.0=hdf63c60_5
- libhwloc=2.1.0=h3c4fd83_0
- libiconv=1.15=h516909a_1006
- libkml=1.3.0=h4fcabce_1010
- liblapack=3.8.0=16_openblas
- libllvm8=8.0.1=hc9558a2_0
- libnetcdf=4.7.3=nompi_h9f9fd6a_101
- libnvstrings=0.14.0a200418=cuda10.1_3339
- libopenblas=0.3.9=h5ec1e0e_0
- libpng=1.6.37=hed695b0_1
- libpq=12.2=h5513abc_1
- libprotobuf=3.8.0=h8b12597_0
- librmm=0.14.0a200418=cuda10.1_258
- libsodium=1.0.17=h516909a_0
- libspatialindex=1.9.3=he1b5a44_3
- libspatialite=4.3.0a=ha48a99a_1034
- libssh2=1.8.2=h22169c7_2
- libstdcxx-ng=9.2.0=hdf63c60_2
- libtiff=4.1.0=hfc65ed5_0
- libuuid=2.32.1=h14c3975_1000
- libxcb=1.13=h14c3975_1002
- libxgboost=1.0.2dev.rapidsai0.13=cuda10.1_6
- libxml2=2.9.10=hee79883_0
- llvm-openmp=10.0.0=hc9558a2_0
- llvmlite=0.31.0=py37h5202443_1
- locket=0.2.0=py_2
- lz4-c=1.8.3=he1b5a44_1001
- markdown=3.2.1=py_0
- markupsafe=1.1.1=py37h8f50634_1
- matplotlib-base=3.2.1=py37h30547a4_0
- mistune=0.8.4=py37h8f50634_1001
- msgpack-python=1.0.0=py37h99015e2_1
- multidict=4.7.5=py37h516909a_0
- multipledispatch=0.6.0=py_0
- munch=2.5.0=py_0
- nbconvert=5.6.1=py37hc8dfbb8_1
- nbformat=5.0.4=py_0
- nccl=2.5.7.1=h51cf6c1_0
- ncurses=6.1=hf484d3e_1002
- networkx=2.4=py_1
- notebook=6.0.3=py37_0
- numba=0.48.0=py37hb3f55d8_0
- numpy=1.18.1=py37h8960a57_1
- nvstrings=0.14.0a200418=py37_3339
- olefile=0.46=py_0
- openjpeg=2.3.1=h981e76c_3
- openssl=1.1.1f=h516909a_0
- packaging=20.1=py_0
- pandas=0.25.3=py37hb3f55d8_0
- pandoc=2.9.2.1=0
- pandocfilters=1.4.2=py_1
- panel=0.6.4=0
- param=1.9.3=py_0
- parquet-cpp=1.5.1=2
- parso=0.7.0=pyh9f0ad1d_0
- partd=1.1.0=py_0
- pcre=8.44=he1b5a44_0
- pexpect=4.8.0=py37hc8dfbb8_1
- pickleshare=0.7.5=py37hc8dfbb8_1001
- pillow=7.1.1=py37h718be6c_0
- pip=20.0.2=py_2
- pixman=0.38.0=h516909a_1003
- poppler=0.67.0=h14e79db_8
- poppler-data=0.4.9=1
- postgresql=12.2=h8573dbc_1
- proj=6.3.0=hc80f0dc_0
- prometheus_client=0.7.1=py_0
- prompt-toolkit=3.0.5=py_0
- psutil=5.7.0=py37h8f50634_1
- pthread-stubs=0.4=h14c3975_1001
- ptyprocess=0.6.0=py_1001
- py-xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
- pyarrow=0.15.0=py37h8b68381_1
- pycparser=2.20=py_0
- pyct=0.4.6=py_0
- pyct-core=0.4.6=py_0
- pyee=7.0.1=py_0
- pygments=2.6.1=py_0
- pynvml=8.0.4=py_0
- pyopenssl=19.1.0=py_1
- pyparsing=2.4.7=pyh9f0ad1d_0
- pyppeteer=0.0.25=py_1
- pyproj=2.5.0=py37h8ff28aa_0
- pyrsistent=0.16.0=py37h8f50634_0
- pysocks=1.7.1=py37hc8dfbb8_1
- python=3.7.6=h8356626_5_cpython
- python-dateutil=2.8.1=py_0
- python_abi=3.7=1_cp37m
- pytz=2019.3=py_0
- pyviz_comms=0.7.4=pyh8c360ce_0
- pywavelets=1.1.1=py37h03ebfcd_1
- pyyaml=5.3.1=py37h8f50634_0
- pyzmq=19.0.0=py37hac76be4_1
- rapids=0.14.0=cuda10.1_py37_150
- rapids-xgboost=0.14.0=cuda10.1_py37_150
- re2=2020.04.01=he1b5a44_0
- readline=8.0=hf8c457e_0
- requests=2.23.0=pyh8c360ce_2
- rmm=0.14.0a200418=py37_258
- rtree=0.9.4=py37h8526d28_1
- scikit-image=0.16.2=py37hb3f55d8_0
- scikit-learn=0.22.2.post1=py37hcdab131_0
- scipy=1.4.1=py37ha3d9a3c_3
- send2trash=1.5.0=py_0
- setuptools=46.1.3=py37hc8dfbb8_0
- shapely=1.7.0=py37hb106bac_1
- simpervisor=0.3=py_1
- six=1.14.0=py_1
- snappy=1.1.8=he1b5a44_1
- sortedcontainers=2.1.0=py_0
- sqlite=3.30.1=hcee41ef_0
- tblib=1.6.0=py_0
- terminado=0.8.3=py37hc8dfbb8_1
- testpath=0.4.4=py_0
- thrift-cpp=0.12.0=hf3afdfd_1004
- tk=8.6.10=hed695b0_0
- toolz=0.10.0=py_0
- tornado=6.0.4=py37h8f50634_1
- tqdm=4.45.0=pyh9f0ad1d_0
- traitlets=4.3.3=py37hc8dfbb8_1
- tzcode=2019a=h516909a_1002
- ucx=1.7.0+g9d06c3a=cuda10.1_0
- uriparser=0.9.3=he1b5a44_1
- urllib3=1.25.9=py_0
- wcwidth=0.1.9=pyh9f0ad1d_0
- webencodings=0.5.1=py_1
- websockets=8.1=py37h8f50634_1
- wheel=0.34.2=py_1
- xarray=0.15.1=py_0
- xerces-c=3.2.2=h8412b87_1004
- xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
- xorg-kbproto=1.0.7=h14c3975_1002
- xorg-libice=1.0.10=h516909a_0
- xorg-libsm=1.2.3=h84519dc_1000
- xorg-libx11=1.6.9=h516909a_0
- xorg-libxau=1.0.9=h14c3975_0
- xorg-libxdmcp=1.1.3=h516909a_0
- xorg-libxext=1.3.4=h516909a_0
- xorg-libxrender=0.9.10=h516909a_1002
- xorg-renderproto=0.11.1=h14c3975_1002
- xorg-xextproto=7.3.0=h14c3975_1002
- xorg-xproto=7.0.31=h14c3975_1007
- xz=5.2.5=h516909a_0
- yaml=0.2.3=h516909a_0
- yarl=1.3.0=py37h516909a_1000
- zeromq=4.3.2=he1b5a44_2
- zict=2.0.0=py_0
- zipp=3.1.0=py_0
- zlib=1.2.11=h516909a_1006
- zstd=1.4.3=h3b9ef0a_0
- pip:
- ucx-py==0.14.0a0+133.ge9a2c92
prefix: /home/wbadar/workspace/.miniconda3/envs/rapids14
From the code below, I think the problematic type name would be np.int32 (renaming it to int should work).
https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/legacy/cuio_common.cpp#L23
I'm surprised to see that it's still going through the legacy reader path in branch-0.14, though.
@OlivierNV based on your triage is this a bug or a new feature request? (labeled as both)
Hi @OlivierNV - I'm not sure np.int32 is the culprit here. Even when I tell cudf to leave the types uninterpreted by setting them all to str, I still see the "Unsupported data types" exception.
Could this be due to dtypes not being a list of strings ? (maybe something like list(my_types.values()) instead of my_types). For example, does it still fail with dtypes=["str", "str", ..., "str"] ?
@harrism At this point this is a feature request for more explicit error messages/doc, but a bug has not been ruled out yet, so intentionally added both labels.
Hmm, no dice switching to a list:
In [2]: cudf.read_csv(s, header=None, names=list(my_types), dtype=list(my_types.values()))
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-13b1f0db46da> in <module>
----> 1 cudf.read_csv(s, header=None, names=list(my_types), dtype=list(my_types.values()))
~/workspace/.miniconda3/envs/rapids14/lib/python3.7/contextlib.py in inner(*args, **kwds)
72 def inner(*args, **kwds):
73 with self._recreate_cm():
---> 74 return func(*args, **kwds)
75 return inner
76
~/workspace/.miniconda3/envs/rapids14/lib/python3.7/site-packages/cudf/io/csv.py in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
82 na_filter=na_filter,
83 prefix=prefix,
---> 84 index_col=index_col,
85 )
86
cudf/_lib/legacy/csv.pyx in cudf._lib.legacy.csv.read_csv()
cudf/_lib/legacy/csv.pyx in cudf._lib.legacy.csv.read_csv()
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1587234373268/work/cpp/src/io/csv/legacy/csv_reader_impl.cu:638: Unsupported data type
What's the output of print(list(my_types).values ? (I think this needs to be a list of strings iirc)
Ahh I think I get your meaning now. Yeah, it's a list of classes (the str/ np.int32/ bool objects themselves). Gimmie one sec to try it with strings
Boom:
In [2]: t = {'frame_time': 'str', 'frame_numer': 'int', 'ip_src': 'str', 'tcp_srcport': 'int', 'ip_dst': 'str', 'tcp_dstport': 'int', 'frame_len': 'int', 'tcp_flags_syn': 'bool', 'tcp_flags_fin': '
...: bool'}
In [3]: cudf.read_csv(s, header=None, names=list(t), dtype=list(t.values()))
Out[3]:
frame_time frame_numer ip_src tcp_srcport ip_dst tcp_dstport frame_len tcp_flags_syn tcp_flags_fin
0 "Jul 3, 2017 11:55:58.598308000 UTC" 1 8.254.250.126 80 192.168.10.5 49188 60 False True
1 "Jul 3, 2017 11:55:58.598312000 UTC" 2 8.254.250.126 80 192.168.10.5 49188 60 False True
2 "Jul 3, 2017 11:55:58.598313000 UTC" 3 8.254.250.126 80 192.168.10.5 49188 60 False True
3 "Jul 3, 2017 11:55:58.598314000 UTC" 4 8.254.250.126 80 192.168.10.5 49188 60 False True
4 "Jul 3, 2017 11:55:58.598315000 UTC" 5 8.254.250.126 80 192.168.10.5 49188 60 False True
5 "Jul 3, 2017 11:55:58.598316000 UTC" 6 8.254.250.126 80 192.168.10.5 49188 60 False True
6 "Jul 3, 2017 11:55:58.598317000 UTC" 7 8.254.250.126 80 192.168.10.5 49188 60 False True
7 "Jul 3, 2017 11:55:58.598318000 UTC" 8 8.254.250.126 80 192.168.10.5 49188 60 False True
8 "Jul 3, 2017 11:56:22.331018000 UTC" 20 8.253.185.121 80 192.168.10.14 49486 60 False True
9 "Jul 3, 2017 11:56:22.331021000 UTC" 21 8.253.185.121 80 192.168.10.14 49486 60 False True
In [4]: _.dtypes
Out[4]:
frame_time object
frame_numer int32
ip_src object
tcp_srcport int32
ip_dst object
tcp_dstport int32
frame_len int32
tcp_flags_syn bool
tcp_flags_fin bool
dtype: object
Thanks for the suggestion @OlivierNV. If you think it'd be appropriate, I'd be happy to contribute some documentation to clarify the expected use of read_csv's dtype parameter, to reflect our discussion. Let me know!
@wbadart Sounds good to me, that'd be great (you can open a doc PR and link to this issue)
I'll draft something up!
Also, here's our call to the legacy reader, since that came up:
Yeah, it looks like the legacy reader is still being used until the csv writer gets ported to libcudf++ (#4342 ), since they're both called from the same python file.
Hi team, I've encountered very similar issue on NGC's latest container (nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04). Has this issue been already resolved?
My error looks like non legacy lib. So, is my case the same as this issue?
Just in case, let me share reproduction code and error message below.
import numpy as np
import cudf
def main():
filepath = './test.csv'
df = cudf.DataFrame()
df['col1'] = list(range(10))
df['col2'] = np.random.random(10)
cudf.io.csv.to_csv(df, path=filepath, header=False, index=False)
names = ['col1', 'col2']
# dtype = {'col1': 'int64', 'col2': 'float64'} # <- It works!
dtype = {'col1': np.int64, 'col2': np.float64}
print(dtype)
df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
print(df)
if __name__ == "__main__":
main()
Traceback (most recent call last):
File "smallest_read_csv_dtype.py", line 25, in <module>
main()
File "smallest_read_csv_dtype.py", line 20, in main
df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
File "/opt/conda/envs/rapids/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/opt/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/csv.py", line 84, in read_csv
index_col=index_col,
File "cudf/_lib/csv.pyx", line 337, in cudf._lib.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1591199376654/work/cpp/src/io/csv/reader_impl.cu:649: Unsupported data type
sudo docker run --gpus=all --rm -it -v $(pwd):/ws nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04
Hi team, I've encountered very similar issue on NGC's latest container (
nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04). Has this issue been already resolved?
My error looks like non legacy lib. So, is my case the same as this issue?Just in case, let me share reproduction code and error message below.
Code
import numpy as np import cudf def main(): filepath = './test.csv' df = cudf.DataFrame() df['col1'] = list(range(10)) df['col2'] = np.random.random(10) cudf.io.csv.to_csv(df, path=filepath, header=False, index=False) names = ['col1', 'col2'] # dtype = {'col1': 'int64', 'col2': 'float64'} # <- It works! dtype = {'col1': np.int64, 'col2': np.float64} print(dtype) df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None) print(df) if __name__ == "__main__": main()Error
Traceback (most recent call last): File "smallest_read_csv_dtype.py", line 25, in <module> main() File "smallest_read_csv_dtype.py", line 20, in main df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None) File "/opt/conda/envs/rapids/lib/python3.6/contextlib.py", line 52, in inner return func(*args, **kwds) File "/opt/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/csv.py", line 84, in read_csv index_col=index_col, File "cudf/_lib/csv.pyx", line 337, in cudf._lib.csv.read_csv RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1591199376654/work/cpp/src/io/csv/reader_impl.cu:649: Unsupported data typeLaunch command
sudo docker run --gpus=all --rm -it -v $(pwd):/ws nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04
Can you try passing a string of int64 instead of np.int64 for the dtypes? Likely a bug on our end in handling the dtypes.
Thanks for the reply, @kkraus14 !
Can you try passing a string of int64 instead of np.int64 for the dtypes?
Yes, although I commented out in my code, the program works well by passing 'int64' and 'float64' strings instead of numpy's dtype like below.
dtype = {'col1': 'int64', 'col2': 'float64'}
Hey all, I saw a similar issue come up as I was playing around with cuDF's read_csv function with RAPIDS on Kaggle. My code runs fine, but it kept on propagating the following error:
RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1598487636199/work/cpp/src/io/csv/reader_impl.cu:651: Unsupported data type
After looking at this issue and guessing a lot, I got my code to the point where I realize 'int64' and 'str' work for dtypes. But I'm struggling with the last column which should be datetime and won't render properly as a string or an integer.
checkout_list = []
for filename in all_checkout_files:
cu = cudf.io.csv.read_csv(filename, index_col = None, header = 0, dtype ={'BibNumber': 'int64', 'ItemBarcode': 'int64', 'ItemType': 'str', 'Collection': 'str', 'CallNumber': 'int64', 'CheckoutDateTime': 'datetime64'})
checkout_list.append(cu)
checkout = cudf.core.reshape.concat(checkout_list, axis=0, ignore_index = True)
The code is meant to append a bunch of cuDF dataframes together that all follow a common pattern. I know that if I replace 'datetime64' with 'int64', this runs properly, at least at first glance. I'm wondering what the proper way for the function to accept datetime as a reference would be.
A basic point of frustration on this has been guessing at the proper way to render datatypes which came on top of data validation errors (columns were being misread which is why I had to set the dtypes at the read_csv level in the first place). I think this problem could be resolved by fixing the underlying bug -- but in the absence of that, correcting this documentation to be more accurate would help a lot.
@Rogerh91 I believe if you use timestamp or something like timestamp[s] for example it should work.
We're actively working on refactoring this code and cleaning this up is definitely one of the things we're planning to tackle.
Hey @kkraus14, thanks for the tip -- just wanted to report that it worked the first time I tried it. It doesn't seem to be anywhere in the documentation which most people will consult when they're stuck on this, but appreciate that you all are refactoring and cleaning things up. That seems like it might be a quick fix in the meantime though (clearing up documentation), or a blog post that will show up on SEO maybe.