Cudf: [FEA] Readers report which specified types are unsupported

Created on 20 Apr 2020 · 17Comments · Source: rapidsai/cudf

Is your feature request related to a problem? Please describe.
Sometimes cudf.read_csv fails with

RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1581433420693/work/cpp/src/io/csv/legacy/csv_reader_impl.cu

when given the dtype=MY_TYPES argument. For example,

from io import StringIO
import cudf
import numpy as np

my_types = {
   'frame_time': str,
   'frame_number': int,
   'ip_src': str,
   'tcp_srcport': np.int32,
   'ip_dst': str,
   'tcp_dstport': np.int32,
   'frame_len': int,
   'tcp_flags_syn': bool,
   'tcp_flags_fin': bool,
}

s = StringIO("""
    "Jul  3, 2017 11:55:58.598308000 UTC","1","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598312000 UTC","2","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598313000 UTC","3","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598314000 UTC","4","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598315000 UTC","5","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598316000 UTC","6","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598317000 UTC","7","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:55:58.598318000 UTC","8","8.254.250.126","80","192.168.10.5","49188","60","0","1"
    "Jul  3, 2017 11:56:22.331018000 UTC","20","8.253.185.121","80","192.168.10.14","49486","60","0","1"
    "Jul  3, 2017 11:56:22.331021000 UTC","21","8.253.185.121","80","192.168.10.14","49486","60","0","1"
""")

print(cudf.read_csv(s, header=None, names=list(my_types.keys()), dtype=my_types).dtypes)

gives

Traceback (most recent call last):
  File "test.py", line 31, in <module>
    print(cudf.read_csv(s, header=None, names=list(dtypes.keys()), dtype=dtypes).dtypes)
  File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/home/wbadar/workspace/.miniconda3/envs/rapids14/lib/python3.7/site-packages/cudf/io/csv.py", line 84, in read_csv
    index_col=index_col,
  File "cudf/_lib/legacy/csv.pyx", line 37, in cudf._lib.legacy.csv.read_csv
  File "cudf/_lib/legacy/csv.pyx", line 227, in cudf._lib.legacy.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1587234373268/work/cpp/src/io/csv/legacy/csv_reader_impl.cu:663: Unsupported data type

While swapping in pandas gives:

frame_time       object
frame_number      int64
ip_src           object
tcp_srcport       int32
ip_dst           object
tcp_dstport       int32
frame_len         int64
tcp_flags_syn    object
tcp_flags_fin    object
dtype: object

(I do wonder if this particular example is hitting a bug, or a problem in my data even; are any of bool, int64, int32 and str actually unsupported?)

Describe the solution you'd like
If it's possible, it would be nice to know _which_ type in MY_TYPES is unsupported. Can

https://github.com/rapidsai/cudf/blob/8e90792e58e6dc24dcae78d3806c0536003fd2bb/cpp/src/io/csv/reader_impl.cu#L627-L628

and

https://github.com/rapidsai/cudf/blob/8e90792e58e6dc24dcae78d3806c0536003fd2bb/cpp/src/io/csv/reader_impl.cu#L641-L642

be extended to support this?

(And I guess also https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L624 and https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/csv/legacy/csv_reader_impl.cu#L638. There might be more spots; this is just what I surfaced with some quick grepping around.)

Describe alternatives you've considered
One alternative would be to simply document supported dtypes. If this exists already, I apologize for not finding it (though if this is the case, could we perhaps link or otherwise include the list in the read_csv documentation?).

Additional context

conda env export for the above example

name: rapids14
channels:
  - rapidsai-nightly
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_llvm
  - aiohttp=3.6.2=py37h516909a_0
  - appdirs=1.4.3=py_1
  - arrow-cpp=0.15.0=py37h090bef1_2
  - async-timeout=3.0.1=py_1000
  - attrs=19.3.0=py_0
  - backcall=0.1.0=py_0
  - bleach=3.1.4=pyh9f0ad1d_0
  - bokeh=1.4.0=py37hc8dfbb8_1
  - boost=1.70.0=py37h9de70de_1
  - boost-cpp=1.70.0=h8e57a91_2
  - brotli=1.0.7=he1b5a44_1001
  - brotlipy=0.7.0=py37h8f50634_1000
  - bzip2=1.0.8=h516909a_2
  - c-ares=1.15.0=h516909a_1001
  - ca-certificates=2020.4.5.1=hecc5488_0
  - cairo=1.16.0=hcf35c78_1003
  - certifi=2020.4.5.1=py37hc8dfbb8_0
  - cffi=1.14.0=py37hd463f26_0
  - cfitsio=3.470=hb60a0a2_2
  - chardet=3.0.4=py37hc8dfbb8_1006
  - click=7.1.1=pyh8c360ce_0
  - click-plugins=1.1.1=py_0
  - cligj=0.5.0=py_0
  - cloudpickle=1.3.0=py_0
  - colorcet=2.0.1=py_0
  - cryptography=2.8=py37hb09aad4_2
  - cudatoolkit=10.1.243=h6bb024c_0
  - cudf=0.14.0a200418=py37_3339
  - cudnn=7.6.0=cuda10.1_0
  - cugraph=0.14.0a200418=py37_299
  - cuml=0.14.0a200418=cuda10.1_py37_1429
  - cupy=7.3.0=py37h0632833_0
  - curl=7.69.1=h33f0ec9_0
  - cusignal=0.14.0a200418=py37_179
  - cuspatial=0.14.0a200418=py37_169
  - cuxfilter=0.14.0a200418=py37_54
  - cycler=0.10.0=py_2
  - cytoolz=0.10.1=py37h516909a_0
  - dask=2.14.0=py_0
  - dask-core=2.14.0=py_0
  - dask-cuda=0.14.0a200418=py37_43
  - dask-cudf=0.14.0a200418=py37_3339
  - dask-xgboost=0.2.0.dev28=cuda10.1py36_0
  - datashader=0.10.0=py_0
  - datashape=0.5.4=py_1
  - decorator=4.4.2=py_0
  - defusedxml=0.6.0=py_0
  - distributed=2.14.0=py37hc8dfbb8_0
  - dlpack=0.2=he1b5a44_1
  - double-conversion=3.1.5=he1b5a44_2
  - entrypoints=0.3=py37hc8dfbb8_1001
  - expat=2.2.9=he1b5a44_2
  - fastavro=0.23.1=py37h8f50634_0
  - fastrlock=0.4=py37h3340039_1001
  - fiona=1.8.9.post2=py37hdff7cfa_0
  - fontconfig=2.13.1=h86ecdb6_1001
  - freetype=2.10.1=he06d7ca_0
  - freexl=1.0.5=h14c3975_1002
  - fsspec=0.7.2=py_0
  - gdal=2.4.4=py37h5f563d9_0
  - geopandas=0.7.0=py_1
  - geos=3.8.0=he1b5a44_1
  - geotiff=1.5.1=h38872f0_8
  - gettext=0.19.8.1=hc5be6a0_1002
  - gflags=2.2.2=he1b5a44_1002
  - giflib=5.1.7=h516909a_1
  - glib=2.64.2=h6f030ca_0
  - glog=0.4.0=h49b9bf7_3
  - grpc-cpp=1.23.0=h18db393_0
  - hdf4=4.2.13=hf30be14_1003
  - hdf5=1.10.5=nompi_h3c11f04_1104
  - heapdict=1.0.1=py_0
  - icu=64.2=he1b5a44_1
  - idna=2.9=py_1
  - imageio=2.8.0=py_0
  - importlib-metadata=1.6.0=py37hc8dfbb8_0
  - importlib_metadata=1.6.0=0
  - ipykernel=5.2.0=py37h43977f1_1
  - ipython=7.13.0=py37hc8dfbb8_2
  - ipython_genutils=0.2.0=py_1
  - jedi=0.17.0=py37hc8dfbb8_0
  - jinja2=2.11.2=pyh9f0ad1d_0
  - joblib=0.14.1=py_0
  - jpeg=9c=h14c3975_1001
  - json-c=0.13.1=h14c3975_1001
  - jsonschema=3.2.0=py37hc8dfbb8_1
  - jupyter-server-proxy=1.3.2=py_0
  - jupyter_client=6.1.3=py_0
  - jupyter_core=4.6.3=py37hc8dfbb8_1
  - kealib=1.4.13=hec59c27_0
  - kiwisolver=1.2.0=py37h99015e2_0
  - krb5=1.17.1=h2fd8d38_0
  - ld_impl_linux-64=2.34=h53a641e_0
  - libblas=3.8.0=16_openblas
  - libcblas=3.8.0=16_openblas
  - libcudf=0.14.0a200418=cuda10.1_3339
  - libcugraph=0.14.0a200418=cuda10.1_299
  - libcuml=0.14.0a200418=cuda10.1_1429
  - libcumlprims=0.14.0a200417=cuda10.1_22
  - libcurl=7.69.1=hf7181ac_0
  - libcuspatial=0.14.0a200418=cuda10.1_169
  - libdap4=3.20.4=hd3bb157_0
  - libedit=3.1.20170329=hf8c457e_1001
  - libevent=2.1.10=h72c5cf5_0
  - libffi=3.2.1=he1b5a44_1007
  - libgcc-ng=9.2.0=h24d8f2e_2
  - libgdal=2.4.4=h2b6fda6_0
  - libgfortran-ng=7.3.0=hdf63c60_5
  - libhwloc=2.1.0=h3c4fd83_0
  - libiconv=1.15=h516909a_1006
  - libkml=1.3.0=h4fcabce_1010
  - liblapack=3.8.0=16_openblas
  - libllvm8=8.0.1=hc9558a2_0
  - libnetcdf=4.7.3=nompi_h9f9fd6a_101
  - libnvstrings=0.14.0a200418=cuda10.1_3339
  - libopenblas=0.3.9=h5ec1e0e_0
  - libpng=1.6.37=hed695b0_1
  - libpq=12.2=h5513abc_1
  - libprotobuf=3.8.0=h8b12597_0
  - librmm=0.14.0a200418=cuda10.1_258
  - libsodium=1.0.17=h516909a_0
  - libspatialindex=1.9.3=he1b5a44_3
  - libspatialite=4.3.0a=ha48a99a_1034
  - libssh2=1.8.2=h22169c7_2
  - libstdcxx-ng=9.2.0=hdf63c60_2
  - libtiff=4.1.0=hfc65ed5_0
  - libuuid=2.32.1=h14c3975_1000
  - libxcb=1.13=h14c3975_1002
  - libxgboost=1.0.2dev.rapidsai0.13=cuda10.1_6
  - libxml2=2.9.10=hee79883_0
  - llvm-openmp=10.0.0=hc9558a2_0
  - llvmlite=0.31.0=py37h5202443_1
  - locket=0.2.0=py_2
  - lz4-c=1.8.3=he1b5a44_1001
  - markdown=3.2.1=py_0
  - markupsafe=1.1.1=py37h8f50634_1
  - matplotlib-base=3.2.1=py37h30547a4_0
  - mistune=0.8.4=py37h8f50634_1001
  - msgpack-python=1.0.0=py37h99015e2_1
  - multidict=4.7.5=py37h516909a_0
  - multipledispatch=0.6.0=py_0
  - munch=2.5.0=py_0
  - nbconvert=5.6.1=py37hc8dfbb8_1
  - nbformat=5.0.4=py_0
  - nccl=2.5.7.1=h51cf6c1_0
  - ncurses=6.1=hf484d3e_1002
  - networkx=2.4=py_1
  - notebook=6.0.3=py37_0
  - numba=0.48.0=py37hb3f55d8_0
  - numpy=1.18.1=py37h8960a57_1
  - nvstrings=0.14.0a200418=py37_3339
  - olefile=0.46=py_0
  - openjpeg=2.3.1=h981e76c_3
  - openssl=1.1.1f=h516909a_0
  - packaging=20.1=py_0
  - pandas=0.25.3=py37hb3f55d8_0
  - pandoc=2.9.2.1=0
  - pandocfilters=1.4.2=py_1
  - panel=0.6.4=0
  - param=1.9.3=py_0
  - parquet-cpp=1.5.1=2
  - parso=0.7.0=pyh9f0ad1d_0
  - partd=1.1.0=py_0
  - pcre=8.44=he1b5a44_0
  - pexpect=4.8.0=py37hc8dfbb8_1
  - pickleshare=0.7.5=py37hc8dfbb8_1001
  - pillow=7.1.1=py37h718be6c_0
  - pip=20.0.2=py_2
  - pixman=0.38.0=h516909a_1003
  - poppler=0.67.0=h14e79db_8
  - poppler-data=0.4.9=1
  - postgresql=12.2=h8573dbc_1
  - proj=6.3.0=hc80f0dc_0
  - prometheus_client=0.7.1=py_0
  - prompt-toolkit=3.0.5=py_0
  - psutil=5.7.0=py37h8f50634_1
  - pthread-stubs=0.4=h14c3975_1001
  - ptyprocess=0.6.0=py_1001
  - py-xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
  - pyarrow=0.15.0=py37h8b68381_1
  - pycparser=2.20=py_0
  - pyct=0.4.6=py_0
  - pyct-core=0.4.6=py_0
  - pyee=7.0.1=py_0
  - pygments=2.6.1=py_0
  - pynvml=8.0.4=py_0
  - pyopenssl=19.1.0=py_1
  - pyparsing=2.4.7=pyh9f0ad1d_0
  - pyppeteer=0.0.25=py_1
  - pyproj=2.5.0=py37h8ff28aa_0
  - pyrsistent=0.16.0=py37h8f50634_0
  - pysocks=1.7.1=py37hc8dfbb8_1
  - python=3.7.6=h8356626_5_cpython
  - python-dateutil=2.8.1=py_0
  - python_abi=3.7=1_cp37m
  - pytz=2019.3=py_0
  - pyviz_comms=0.7.4=pyh8c360ce_0
  - pywavelets=1.1.1=py37h03ebfcd_1
  - pyyaml=5.3.1=py37h8f50634_0
  - pyzmq=19.0.0=py37hac76be4_1
  - rapids=0.14.0=cuda10.1_py37_150
  - rapids-xgboost=0.14.0=cuda10.1_py37_150
  - re2=2020.04.01=he1b5a44_0
  - readline=8.0=hf8c457e_0
  - requests=2.23.0=pyh8c360ce_2
  - rmm=0.14.0a200418=py37_258
  - rtree=0.9.4=py37h8526d28_1
  - scikit-image=0.16.2=py37hb3f55d8_0
  - scikit-learn=0.22.2.post1=py37hcdab131_0
  - scipy=1.4.1=py37ha3d9a3c_3
  - send2trash=1.5.0=py_0
  - setuptools=46.1.3=py37hc8dfbb8_0
  - shapely=1.7.0=py37hb106bac_1
  - simpervisor=0.3=py_1
  - six=1.14.0=py_1
  - snappy=1.1.8=he1b5a44_1
  - sortedcontainers=2.1.0=py_0
  - sqlite=3.30.1=hcee41ef_0
  - tblib=1.6.0=py_0
  - terminado=0.8.3=py37hc8dfbb8_1
  - testpath=0.4.4=py_0
  - thrift-cpp=0.12.0=hf3afdfd_1004
  - tk=8.6.10=hed695b0_0
  - toolz=0.10.0=py_0
  - tornado=6.0.4=py37h8f50634_1
  - tqdm=4.45.0=pyh9f0ad1d_0
  - traitlets=4.3.3=py37hc8dfbb8_1
  - tzcode=2019a=h516909a_1002
  - ucx=1.7.0+g9d06c3a=cuda10.1_0
  - uriparser=0.9.3=he1b5a44_1
  - urllib3=1.25.9=py_0
  - wcwidth=0.1.9=pyh9f0ad1d_0
  - webencodings=0.5.1=py_1
  - websockets=8.1=py37h8f50634_1
  - wheel=0.34.2=py_1
  - xarray=0.15.1=py_0
  - xerces-c=3.2.2=h8412b87_1004
  - xgboost=1.0.2dev.rapidsai0.13=cuda10.1py37_6
  - xorg-kbproto=1.0.7=h14c3975_1002
  - xorg-libice=1.0.10=h516909a_0
  - xorg-libsm=1.2.3=h84519dc_1000
  - xorg-libx11=1.6.9=h516909a_0
  - xorg-libxau=1.0.9=h14c3975_0
  - xorg-libxdmcp=1.1.3=h516909a_0
  - xorg-libxext=1.3.4=h516909a_0
  - xorg-libxrender=0.9.10=h516909a_1002
  - xorg-renderproto=0.11.1=h14c3975_1002
  - xorg-xextproto=7.3.0=h14c3975_1002
  - xorg-xproto=7.0.31=h14c3975_1007
  - xz=5.2.5=h516909a_0
  - yaml=0.2.3=h516909a_0
  - yarl=1.3.0=py37h516909a_1000
  - zeromq=4.3.2=he1b5a44_2
  - zict=2.0.0=py_0
  - zipp=3.1.0=py_0
  - zlib=1.2.11=h516909a_1006
  - zstd=1.4.3=h3b9ef0a_0
  - pip:
    - ucx-py==0.14.0a0+133.ge9a2c92
prefix: /home/wbadar/workspace/.miniconda3/envs/rapids14

bug cuDF (Python) cuIO doc feature request

Source

wbadart

All 17 comments

From the code below, I think the problematic type name would be np.int32 (renaming it to int should work).
https://github.com/rapidsai/cudf/blob/branch-0.14/cpp/src/io/legacy/cuio_common.cpp#L23
I'm surprised to see that it's still going through the legacy reader path in branch-0.14, though.

OlivierNV on 21 Apr 2020

@OlivierNV based on your triage is this a bug or a new feature request? (labeled as both)

harrism on 22 Apr 2020

Hi @OlivierNV - I'm not sure np.int32 is the culprit here. Even when I tell cudf to leave the types uninterpreted by setting them all to str, I still see the "Unsupported data types" exception.

wbadart on 22 Apr 2020

Could this be due to dtypes not being a list of strings ? (maybe something like list(my_types.values()) instead of my_types). For example, does it still fail with dtypes=["str", "str", ..., "str"] ?

@harrism At this point this is a feature request for more explicit error messages/doc, but a bug has not been ruled out yet, so intentionally added both labels.

OlivierNV on 22 Apr 2020

Hmm, no dice switching to a list:

In [2]: cudf.read_csv(s, header=None, names=list(my_types), dtype=list(my_types.values()))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-13b1f0db46da> in <module>
----> 1 cudf.read_csv(s, header=None, names=list(my_types), dtype=list(my_types.values()))

~/workspace/.miniconda3/envs/rapids14/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76

~/workspace/.miniconda3/envs/rapids14/lib/python3.7/site-packages/cudf/io/csv.py in read_csv(filepath_or_buffer, lineterminator, quotechar, quoting, doublequote, header, mangle_dupe_cols, usecols, sep, delimiter, delim_whitespace, skipinitialspace, names, dtype, skipfooter, skiprows, dayfirst, compression, thousands, decimal, true_values, false_values, nrows, byte_range, skip_blank_lines, parse_dates, comment, na_values, keep_default_na, na_filter, prefix, index_col, **kwargs)
     82         na_filter=na_filter,
     83         prefix=prefix,
---> 84         index_col=index_col,
     85     )
     86

cudf/_lib/legacy/csv.pyx in cudf._lib.legacy.csv.read_csv()

cudf/_lib/legacy/csv.pyx in cudf._lib.legacy.csv.read_csv()

RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1587234373268/work/cpp/src/io/csv/legacy/csv_reader_impl.cu:638: Unsupported data type

wbadart on 22 Apr 2020

What's the output of print(list(my_types).values ? (I think this needs to be a list of strings iirc)

OlivierNV on 22 Apr 2020

Ahh I think I get your meaning now. Yeah, it's a list of classes (the str/ np.int32/ bool objects themselves). Gimmie one sec to try it with strings

wbadart on 22 Apr 2020

Boom:

In [2]: t = {'frame_time': 'str', 'frame_numer': 'int', 'ip_src': 'str', 'tcp_srcport': 'int', 'ip_dst': 'str', 'tcp_dstport': 'int', 'frame_len': 'int', 'tcp_flags_syn': 'bool', 'tcp_flags_fin': '
   ...: bool'}

In [3]: cudf.read_csv(s, header=None, names=list(t), dtype=list(t.values()))
Out[3]:
                                  frame_time  frame_numer         ip_src  tcp_srcport         ip_dst  tcp_dstport  frame_len  tcp_flags_syn  tcp_flags_fin
0      "Jul  3, 2017 11:55:58.598308000 UTC"            1  8.254.250.126           80   192.168.10.5        49188         60          False           True
1      "Jul  3, 2017 11:55:58.598312000 UTC"            2  8.254.250.126           80   192.168.10.5        49188         60          False           True
2      "Jul  3, 2017 11:55:58.598313000 UTC"            3  8.254.250.126           80   192.168.10.5        49188         60          False           True
3      "Jul  3, 2017 11:55:58.598314000 UTC"            4  8.254.250.126           80   192.168.10.5        49188         60          False           True
4      "Jul  3, 2017 11:55:58.598315000 UTC"            5  8.254.250.126           80   192.168.10.5        49188         60          False           True
5      "Jul  3, 2017 11:55:58.598316000 UTC"            6  8.254.250.126           80   192.168.10.5        49188         60          False           True
6      "Jul  3, 2017 11:55:58.598317000 UTC"            7  8.254.250.126           80   192.168.10.5        49188         60          False           True
7      "Jul  3, 2017 11:55:58.598318000 UTC"            8  8.254.250.126           80   192.168.10.5        49188         60          False           True
8      "Jul  3, 2017 11:56:22.331018000 UTC"           20  8.253.185.121           80  192.168.10.14        49486         60          False           True
9      "Jul  3, 2017 11:56:22.331021000 UTC"           21  8.253.185.121           80  192.168.10.14        49486         60          False           True

In [4]: _.dtypes
Out[4]:
frame_time       object
frame_numer       int32
ip_src           object
tcp_srcport       int32
ip_dst           object
tcp_dstport       int32
frame_len         int32
tcp_flags_syn      bool
tcp_flags_fin      bool
dtype: object

Thanks for the suggestion @OlivierNV. If you think it'd be appropriate, I'd be happy to contribute some documentation to clarify the expected use of read_csv's dtype parameter, to reflect our discussion. Let me know!

wbadart on 22 Apr 2020

👍1

@wbadart Sounds good to me, that'd be great (you can open a doc PR and link to this issue)

OlivierNV on 22 Apr 2020

I'll draft something up!

Also, here's our call to the legacy reader, since that came up:

https://github.com/rapidsai/cudf/blob/fff2bedc878f37b54fae61c7b4511f1a023556c3/python/cudf/cudf/io/csv.py#L52

wbadart on 23 Apr 2020

👍1

Yeah, it looks like the legacy reader is still being used until the csv writer gets ported to libcudf++ (#4342 ), since they're both called from the same python file.

OlivierNV on 23 Apr 2020

Hi team, I've encountered very similar issue on NGC's latest container (nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04). Has this issue been already resolved?
My error looks like non legacy lib. So, is my case the same as this issue?

Just in case, let me share reproduction code and error message below.

Code



import numpy as np
import cudf


def main():
    filepath = './test.csv'

    df = cudf.DataFrame()
    df['col1'] = list(range(10))
    df['col2'] = np.random.random(10)

    cudf.io.csv.to_csv(df, path=filepath, header=False, index=False)

    names = ['col1', 'col2']
    # dtype = {'col1': 'int64', 'col2': 'float64'}  # <- It works!
    dtype = {'col1': np.int64, 'col2': np.float64}
    print(dtype)
    df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
    print(df)


if __name__ == "__main__":
    main()

Error

Traceback (most recent call last):
  File "smallest_read_csv_dtype.py", line 25, in <module>
    main()
  File "smallest_read_csv_dtype.py", line 20, in main
    df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
  File "/opt/conda/envs/rapids/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/csv.py", line 84, in read_csv
    index_col=index_col,
  File "cudf/_lib/csv.pyx", line 337, in cudf._lib.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1591199376654/work/cpp/src/io/csv/reader_impl.cu:649: Unsupported data type

Launch command

sudo docker run --gpus=all --rm -it -v $(pwd):/ws nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04

lazykyama on 19 Jun 2020

Just in case, let me share reproduction code and error message below.

Code

import numpy as np
import cudf


def main():
    filepath = './test.csv'

    df = cudf.DataFrame()
    df['col1'] = list(range(10))
    df['col2'] = np.random.random(10)

    cudf.io.csv.to_csv(df, path=filepath, header=False, index=False)

    names = ['col1', 'col2']
    # dtype = {'col1': 'int64', 'col2': 'float64'}  # <- It works!
    dtype = {'col1': np.int64, 'col2': np.float64}
    print(dtype)
    df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
    print(df)


if __name__ == "__main__":
    main()

Error

Traceback (most recent call last):
  File "smallest_read_csv_dtype.py", line 25, in <module>
    main()
  File "smallest_read_csv_dtype.py", line 20, in main
    df = cudf.io.csv.read_csv(filepath, names=names, dtype=dtype, header=None)
  File "/opt/conda/envs/rapids/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/opt/conda/envs/rapids/lib/python3.6/site-packages/cudf/io/csv.py", line 84, in read_csv
    index_col=index_col,
  File "cudf/_lib/csv.pyx", line 337, in cudf._lib.csv.read_csv
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1591199376654/work/cpp/src/io/csv/reader_impl.cu:649: Unsupported data type

Launch command

sudo docker run --gpus=all --rm -it -v $(pwd):/ws nvcr.io/nvidia/rapidsai/rapidsai:0.14-cuda10.2-runtime-ubuntu18.04

Can you try passing a string of int64 instead of np.int64 for the dtypes? Likely a bug on our end in handling the dtypes.

kkraus14 on 19 Jun 2020

Thanks for the reply, @kkraus14 !

Can you try passing a string of int64 instead of np.int64 for the dtypes?

Yes, although I commented out in my code, the program works well by passing 'int64' and 'float64' strings instead of numpy's dtype like below.

dtype = {'col1': 'int64', 'col2': 'float64'}

lazykyama on 19 Jun 2020

Hey all, I saw a similar issue come up as I was playing around with cuDF's read_csv function with RAPIDS on Kaggle. My code runs fine, but it kept on propagating the following error:

RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1598487636199/work/cpp/src/io/csv/reader_impl.cu:651: Unsupported data type

After looking at this issue and guessing a lot, I got my code to the point where I realize 'int64' and 'str' work for dtypes. But I'm struggling with the last column which should be datetime and won't render properly as a string or an integer.

checkout_list = []

for filename in all_checkout_files:
    cu = cudf.io.csv.read_csv(filename, index_col = None, header = 0, dtype ={'BibNumber': 'int64', 'ItemBarcode': 'int64', 'ItemType': 'str', 'Collection': 'str', 'CallNumber': 'int64', 'CheckoutDateTime': 'datetime64'})
    checkout_list.append(cu)

checkout = cudf.core.reshape.concat(checkout_list, axis=0, ignore_index = True)

The code is meant to append a bunch of cuDF dataframes together that all follow a common pattern. I know that if I replace 'datetime64' with 'int64', this runs properly, at least at first glance. I'm wondering what the proper way for the function to accept datetime as a reference would be.

A basic point of frustration on this has been guessing at the proper way to render datatypes which came on top of data validation errors (columns were being misread which is why I had to set the dtypes at the read_csv level in the first place). I think this problem could be resolved by fixing the underlying bug -- but in the absence of that, correcting this documentation to be more accurate would help a lot.

Rogerh91 on 9 Oct 2020

@Rogerh91 I believe if you use timestamp or something like timestamp[s] for example it should work.

We're actively working on refactoring this code and cleaning this up is definitely one of the things we're planning to tackle.

kkraus14 on 9 Oct 2020

Hey @kkraus14, thanks for the tip -- just wanted to report that it worked the first time I tried it. It doesn't seem to be anywhere in the documentation which most people will consult when they're stuck on this, but appreciate that you all are refactoring and cleaning things up. That seems like it might be a quick fix in the meantime though (clearing up documentation), or a blog post that will show up on SEO maybe.

Rogerh91 on 10 Oct 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings