Hi the Pandas dream team.
I think it would be nice if rolling could accept strings as well (see https://stackoverflow.com/questions/52657429/rolling-with-string-variables)
With the abundance of textual data nowadays, we want Pandas to stay at the top of the curve!
import pandas as pd
import numpy as np
df = pd.DataFrame({'mytime' : [pd.to_datetime('2018-01-01 14:34:12.340'),
pd.to_datetime('2018-01-01 14:34:13.0'),
pd.to_datetime('2018-01-01 14:34:15.342'),
pd.to_datetime('2018-01-01 14:34:16.42'),
pd.to_datetime('2018-01-01 14:34:28.742')],
'myvalue' : [1,2,np.NaN,3,1],
'mychart' : ['a','b','c','d','e']})
df.set_index('mytime', inplace = True)
df
Out[15]:
mychart myvalue
mytime
2018-01-01 14:34:12.340 a 1.0
2018-01-01 14:34:13.000 b 2.0
2018-01-01 14:34:15.342 c NaN
2018-01-01 14:34:16.420 d 3.0
2018-01-01 14:34:28.742 e 1.0
Here I want to concatenate the strings in mychart using the values in the last 2 seconds (not the last two observations).
Unfortunately, both attempts below fail miserably
df.mychart.rolling(window = '2s', closed = 'right').apply(lambda x: ' '.join(x), raw = False)
df.mychart.rolling(window = '2s', closed = 'right').apply(lambda x: (x + ' ').cumsum(), raw = False)
TypeError: cannot handle this type -> object
What do you think?
Thanks!
Related to #18129 and #20244 may be of note. Rolling with object types is not supported in a lot of cases (most ops require or cast to float). IMO can be theoretically supported though may cause conflicts with other issues like #12537
+1 to this!
Even if most ops require a cast to float, apply should work on strings. I have a table with timestamps and strings, and I was hoping to group records with time windows and process the strings using apply and a custom function. While there may be a workaround, this seems to me the most natural way of doing it.
Hopefully this can be fixed at some point :)
In the stackoverflow page mentioned above, it was suggested to use resample as workaround. If rolling on a time-based window, that's a fine workaround. However, if it's rolling on an offset-based window, resample can't help you...
+1
I came across this issue today when trying to apply aggregates as below.
I managed to find another workaround for my use-case of count unique instances in a time-based rolling window, by mapping each unique string to a numeric value.
It would be great to have a fix for this issue, as my workaround is not scalable for large numbers of distinct strings. Thanks!
df.groupby('group_col').rolling('90d', on='time_col').agg({'num_col': 'sum', 'string_col': pd.Series.nunique})
with exception:
/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
224 raise TypeError("cannot handle this type -> {0}"
--> 225 "".format(values.dtype))
226
TypeError: cannot handle this type -> object
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
221 try:
--> 222 values = _ensure_float64(values)
223 except (ValueError, TypeError):
pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()
pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()
ValueError: could not convert string to float: '<string>'
...
/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
223 except (ValueError, TypeError):
224 raise TypeError("cannot handle this type -> {0}"
--> 225 "".format(values.dtype))
226
227 if kill_inf:
TypeError: cannot handle this type -> object
Another example where this could be useful:
import random, string
In [101]: strings = np.random.choice([''.join(random.choices(string.ascii_uppercase + string.digits, k=5)) for _ in range(20)], 100)
In [102]: s = pd.Series(strings, index=pd.date_range("2019-01-01", freq='H', periods=100))
In [103]: s.rolling("1D").apply(pd.Series.nunique, raw=False)
...
TypeError: cannot handle this type -> object
Now, you first need to convert it to numerical values:
In [105]: s.astype('category').cat.codes.rolling('1D').apply(pd.Series.nunique, raw=False)
Out[105]:
2019-01-01 00:00:00 1.0
2019-01-01 01:00:00 2.0
2019-01-01 02:00:00 2.0
...
I'll happily pick this issue up, I'll open up a PR soon
Need to find out fist/last on a string column in a rolling window.
Most helpful comment
+1 to this!
Even if most ops require a cast to float, apply should work on strings. I have a table with timestamps and strings, and I was hoping to group records with time windows and process the strings using apply and a custom function. While there may be a workaround, this seems to me the most natural way of doing it.
Hopefully this can be fixed at some point :)