Pandas: ENH: allow rolling with non-numerical (eg string) data

Created on 5 Oct 2018  路  7Comments  路  Source: pandas-dev/pandas

Hi the Pandas dream team.

I think it would be nice if rolling could accept strings as well (see https://stackoverflow.com/questions/52657429/rolling-with-string-variables)

With the abundance of textual data nowadays, we want Pandas to stay at the top of the curve!

import pandas as pd
import numpy as np

df = pd.DataFrame({'mytime' : [pd.to_datetime('2018-01-01 14:34:12.340'),
                             pd.to_datetime('2018-01-01 14:34:13.0'),
                             pd.to_datetime('2018-01-01 14:34:15.342'),
                             pd.to_datetime('2018-01-01 14:34:16.42'),
                             pd.to_datetime('2018-01-01 14:34:28.742')],
                    'myvalue' : [1,2,np.NaN,3,1],
                    'mychart' : ['a','b','c','d','e']})

df.set_index('mytime', inplace = True)

df
Out[15]: 
                        mychart  myvalue
mytime                                  
2018-01-01 14:34:12.340       a      1.0
2018-01-01 14:34:13.000       b      2.0
2018-01-01 14:34:15.342       c      NaN
2018-01-01 14:34:16.420       d      3.0
2018-01-01 14:34:28.742       e      1.0

Here I want to concatenate the strings in mychart using the values in the last 2 seconds (not the last two observations).

Unfortunately, both attempts below fail miserably

df.mychart.rolling(window = '2s', closed = 'right').apply(lambda x: ' '.join(x), raw = False)
df.mychart.rolling(window = '2s', closed = 'right').apply(lambda x: (x + ' ').cumsum(), raw = False)

TypeError: cannot handle this type -> object

What do you think?
Thanks!

Apply Enhancement Window

Most helpful comment

+1 to this!

Even if most ops require a cast to float, apply should work on strings. I have a table with timestamps and strings, and I was hoping to group records with time windows and process the strings using apply and a custom function. While there may be a workaround, this seems to me the most natural way of doing it.

Hopefully this can be fixed at some point :)

All 7 comments

Related to #18129 and #20244 may be of note. Rolling with object types is not supported in a lot of cases (most ops require or cast to float). IMO can be theoretically supported though may cause conflicts with other issues like #12537

+1 to this!

Even if most ops require a cast to float, apply should work on strings. I have a table with timestamps and strings, and I was hoping to group records with time windows and process the strings using apply and a custom function. While there may be a workaround, this seems to me the most natural way of doing it.

Hopefully this can be fixed at some point :)

In the stackoverflow page mentioned above, it was suggested to use resample as workaround. If rolling on a time-based window, that's a fine workaround. However, if it's rolling on an offset-based window, resample can't help you...

+1

I came across this issue today when trying to apply aggregates as below.

I managed to find another workaround for my use-case of count unique instances in a time-based rolling window, by mapping each unique string to a numeric value.

It would be great to have a fix for this issue, as my workaround is not scalable for large numbers of distinct strings. Thanks!

df.groupby('group_col').rolling('90d', on='time_col').agg({'num_col': 'sum', 'string_col': pd.Series.nunique})

with exception:

/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    224                 raise TypeError("cannot handle this type -> {0}"
--> 225                                 "".format(values.dtype))
    226 

TypeError: cannot handle this type -> object

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    221             try:
--> 222                 values = _ensure_float64(values)
    223             except (ValueError, TypeError):

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()

ValueError: could not convert string to float: '<string>'

...

/opt/conda/lib/python3.6/site-packages/pandas/core/window.py in _prep_values(self, values, kill_inf)
    223             except (ValueError, TypeError):
    224                 raise TypeError("cannot handle this type -> {0}"
--> 225                                 "".format(values.dtype))
    226 
    227         if kill_inf:

TypeError: cannot handle this type -> object

Another example where this could be useful:

import random, string

In [101]: strings = np.random.choice([''.join(random.choices(string.ascii_uppercase + string.digits, k=5)) for _ in range(20)], 100) 

In [102]: s = pd.Series(strings, index=pd.date_range("2019-01-01", freq='H', periods=100)) 

In [103]: s.rolling("1D").apply(pd.Series.nunique, raw=False) 
...
TypeError: cannot handle this type -> object

Now, you first need to convert it to numerical values:

In [105]: s.astype('category').cat.codes.rolling('1D').apply(pd.Series.nunique, raw=False)
Out[105]: 
2019-01-01 00:00:00     1.0
2019-01-01 01:00:00     2.0
2019-01-01 02:00:00     2.0
...

I'll happily pick this issue up, I'll open up a PR soon

Need to find out fist/last on a string column in a rolling window.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ericdf picture ericdf  路  3Comments

MatzeB picture MatzeB  路  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments

marcelnem picture marcelnem  路  3Comments