Cudf: [BUG] "NaT" string literal needs to be recognized as `null` in to_timestamps method

Created on 7 May 2020  路  3Comments  路  Source: rapidsai/cudf

Describe the bug
We can typecast date-time values that are in string format to real DateTime values using type-casting. But where there is a "NaT" string present in the string column we should be returning a null value in its place. Since to_timestamps is the actual underlying libcudf api that is being called to convert strings to datetime types, we will need this to be handling in that method. Code sample below..

Steps/Code to reproduce bug

>>> import pandas as pd
>>> s = pd.Series(["2001-01-01", "2002-02-02", "2000-01-05", "NaT"])
>>> s.astype('datetime64[s]')
0   2001-01-01
1   2002-02-02
2   2000-01-05
3          NaT
dtype: datetime64[ns]
>>> s = pd.Series(["2001-01-01", "2002-02-02", "2000-01-05", "None"])
>>> s.astype('datetime64[s]')
Traceback (most recent call last):
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 1979, in objects_to_datetime64ns
    values, tz_parsed = conversion.datetime_to_datetime64(data)
  File "pandas/_libs/tslibs/conversion.pyx", line 200, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/generic.py", line 5882, in astype
    dtype=dtype, copy=copy, errors=errors, **kwargs
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 581, in astype
    return self.apply("astype", dtype=dtype, **kwargs)
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 438, in apply
    applied = getattr(b, f)(**kwargs)
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 559, in astype
    return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 643, in _astype
    values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 715, in astype_nansafe
    return astype_nansafe(to_datetime(arr).values, dtype, copy=copy)
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
    return func(*args, **kwargs)
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 794, in to_datetime
    result = convert_listlike(arg, box, format)
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 463, in _convert_listlike_datetimes
    allow_object=True,
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 1984, in objects_to_datetime64ns
    raise e
  File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 1975, in objects_to_datetime64ns
    require_iso8601=require_iso8601,
  File "pandas/_libs/tslib.pyx", line 465, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 688, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 822, in pandas._libs.tslib.array_to_datetime_object
  File "pandas/_libs/tslib.pyx", line 813, in pandas._libs.tslib.array_to_datetime_object
  File "pandas/_libs/tslibs/parsing.pyx", line 225, in pandas._libs.tslibs.parsing.parse_datetime_string
  File "/conda/envs/cudf/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/conda/envs/cudf/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 649, in parse
    raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: None


>>> import cudf
>>> s = cudf.Series(["2001-01-01", "2002-02-02", "2000-01-05", "NaT"])
>>> s.astype('datetime64[s]')
0   2001-01-01
1   2002-02-02
2   2000-01-05
3   1970-01-01
dtype: datetime64[s]
>>> s = cudf.Series(["2001-01-01", "2002-02-02", "2000-01-05", "None"])
>>> s.astype('datetime64[s]')
0   2001-01-01
1   2002-02-02
2   2000-01-05
3   1970-01-01
dtype: datetime64[s]
>>> 

Expected behavior
We will have to convert the "NaT" to null in to_timestamps to match pandas behavior.

Also, a follow to this issue is what should be doing in case of "None" strings as we see above pandas actually errors, but since we support null in our column types, could we aswell null the "None" values in to_timestamps method ?

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: from source
bug cuDF (Python) strings

Most helpful comment

Is NaT a Pandas specific thing? If so, this should be done as a separate preprocessing step in Python to replace "NaT" with a null.

All 3 comments

Is NaT a Pandas specific thing? If so, this should be done as a separate preprocessing step in Python to replace "NaT" with a null.

Okay, Understood.

It's a numpy specific equivalent of np.nan for times: https://docs.scipy.org/doc/numpy/reference/generated/numpy.isnat.html

Yup, they also use min(int64) as the sentinel value so we should have the tools to handle this already from the Python side.

Was this page helpful?
0 / 5 - 0 ratings