Describe the bug
When trying to call quantile with datetime64[ns] data, I get the following exception:
RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1598487738985/work/cpp/src/quantiles/quantile.cu:45: quantile does not support non-numeric types
Steps to Reproduce
rng = pd.date_range('2015-02-24', periods=5, freq='D')
df = pd.DataFrame({ 'date': rng, 'Val' : np.random.randn(len(rng))})
cdf = cudf.DataFrame.from_pandas(df)
cdf.quantile(0.8, **{})
I can work around this by casting:
cdf['date'].astype('int64').quantile(0.8, **{}).astype('datetime64[ns]')
But this gives me an error on the first parameter:
AttributeError: 'Scalar' object has no attribute 'astype'
Expected behavior
Quantile should accept a float (in addition to an array-like) for the input q parameter. It should also accept datetime data and return the appropriate quantile result to match pandas quantile: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html
Throws:
RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1598487738985/work/cpp/src/quantiles/quantile.cu:45: quantile does not support non-numeric types
@kkraus14 I don't understand your comment.
The reproducer code above throws the exception in my comment.
OK, I was just confused because you repeated what the OP wrote. So are we really expected to compute quantiles of non-numeric types?
Woops, missed it from not being in a code block, sorry!
I'm not sure it makes sense for all non-numeric types, but it does logically make sense for datetimes because you can logically interpolate them for when quantiles don't fall on an exact value.
Yeah, that seems doable. Just a matter of when to prioritize it...
Changed to feature request and edited title.
I do not think quantiles at the C++ level should support datetime types as it loses type information, i.e., quantilies always returns double. We can't represent floating point datetime values. Caller should opt into losing the type information by first casting to an integer type (ideally via logical_cast to avoid a deep copy).
Updated to just be a Python feature request, thanks!
I wasn't sure if I should add this clarification here or edit the first post, but I'd also like to make sure that quantile returns a float when q is a float. This is the expected behavior for pandas.series.quantile: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.quantile.html
Let me know if I should add that to the initial post. Thanks.
@programmylife leaving in this issue is perfectly fine, thanks!
I can work on this
So one small issue with this is that when given a single value for q pandas returns a series like:
date 2015-02-27 04:48:00
Val 1.36864
Name: 0.8, dtype: object
where each original column is represented as a row.
However, cudf doesn't seem to support multiple data types in a series, so in cases where there's at least one datetime and one numeric cudf can't mimic pandas. Do you think it'd be best to return the result as a pandas series or return a cudf dataframe formatted like:
date val
0.8 2015-02-27 04:48:00 1.368641
which is how pandas returns when a list is given as q
I just realized my repro code is slightly different than what we have in our codebase. We are supplying a Series as input (and a float for q), so we get back just the datetime value from Pandas. A better example that shows this is:
rng = pd.date_range('2015-02-24', periods=5, freq='D')
df = pd.Series(rng)
cdf = cudf.DataFrame.from_pandas(df)
cdf.quantile(0.8, **{})
As for your proposed solution, I don't have a strong opinion but your solution seems reasonable if there indeed is no way to have multiple types in a series in cudf.