Describe the bug
We can typecast date-time values that are in string format to real DateTime values using type-casting. But where there is a "NaT" string present in the string column we should be returning a null value in its place. Since to_timestamps is the actual underlying libcudf api that is being called to convert strings to datetime types, we will need this to be handling in that method. Code sample below..
Steps/Code to reproduce bug
>>> import pandas as pd
>>> s = pd.Series(["2001-01-01", "2002-02-02", "2000-01-05", "NaT"])
>>> s.astype('datetime64[s]')
0 2001-01-01
1 2002-02-02
2 2000-01-05
3 NaT
dtype: datetime64[ns]
>>> s = pd.Series(["2001-01-01", "2002-02-02", "2000-01-05", "None"])
>>> s.astype('datetime64[s]')
Traceback (most recent call last):
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 1979, in objects_to_datetime64ns
values, tz_parsed = conversion.datetime_to_datetime64(data)
File "pandas/_libs/tslibs/conversion.pyx", line 200, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/generic.py", line 5882, in astype
dtype=dtype, copy=copy, errors=errors, **kwargs
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 581, in astype
return self.apply("astype", dtype=dtype, **kwargs)
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 438, in apply
applied = getattr(b, f)(**kwargs)
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 559, in astype
return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 643, in _astype
values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 715, in astype_nansafe
return astype_nansafe(to_datetime(arr).values, dtype, copy=copy)
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 794, in to_datetime
result = convert_listlike(arg, box, format)
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/tools/datetimes.py", line 463, in _convert_listlike_datetimes
allow_object=True,
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 1984, in objects_to_datetime64ns
raise e
File "/conda/envs/cudf/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 1975, in objects_to_datetime64ns
require_iso8601=require_iso8601,
File "pandas/_libs/tslib.pyx", line 465, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 688, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 822, in pandas._libs.tslib.array_to_datetime_object
File "pandas/_libs/tslib.pyx", line 813, in pandas._libs.tslib.array_to_datetime_object
File "pandas/_libs/tslibs/parsing.pyx", line 225, in pandas._libs.tslibs.parsing.parse_datetime_string
File "/conda/envs/cudf/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/conda/envs/cudf/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 649, in parse
raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: None
>>> import cudf
>>> s = cudf.Series(["2001-01-01", "2002-02-02", "2000-01-05", "NaT"])
>>> s.astype('datetime64[s]')
0 2001-01-01
1 2002-02-02
2 2000-01-05
3 1970-01-01
dtype: datetime64[s]
>>> s = cudf.Series(["2001-01-01", "2002-02-02", "2000-01-05", "None"])
>>> s.astype('datetime64[s]')
0 2001-01-01
1 2002-02-02
2 2000-01-05
3 1970-01-01
dtype: datetime64[s]
>>>
Expected behavior
We will have to convert the "NaT" to null in to_timestamps to match pandas behavior.
Also, a follow to this issue is what should be doing in case of "None" strings as we see above pandas actually errors, but since we support null in our column types, could we aswell null the "None" values in to_timestamps method ?
Environment overview (please complete the following information)
Is NaT a Pandas specific thing? If so, this should be done as a separate preprocessing step in Python to replace "NaT" with a null.
Okay, Understood.
It's a numpy specific equivalent of np.nan for times: https://docs.scipy.org/doc/numpy/reference/generated/numpy.isnat.html
Yup, they also use min(int64) as the sentinel value so we should have the tools to handle this already from the Python side.
Most helpful comment
Is NaT a Pandas specific thing? If so, this should be done as a separate preprocessing step in Python to replace "NaT" with a null.