Pandas: ENH: add regex to pandas.DataFrame.isin function (possible speed up?)

Created on 8 May 2014 · 6Comments · Source: pandas-dev/pandas

I think it would be nice if regex were compatible with pandas.DataFrame.isin.

Lets say I am inputting the way snakes attack in a dataframe df I would tagging a column df['Attack method'] using a list poison and a list squeeze to check if there are matches in df[Genus]

df['Attack method'] = (df.Genus.isin(poison).map({True:'Poison',False:''}) +
                df.Genus.isin(squeeze).map({True:'Squeeze',False:''}))

If I was using Regex I would have to somehow get rid of the user warning

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.  " groups, use str.extract.", UserWarning

and create for loops like this:

for i in range(len(poison)):
    df.loc[df.Genus.str.contains(poison[i]) & (df.Attack method == ''),'Attack method'] = 'Poison'
for i in range(len(squeeze)):
    df.loc[df.Genus.str.contains(squeeze[i]) & (df.Attack method == ''),'Attack method'] = 'Squeeze'

unless there are other ways I missed out?

Performance Strings

Source

ccsv

Most helpful comment

It would be great to have regex capabilities in isin, instead of just perfect matches. Because str.contains only accepts strings, its not possible to do something like df1.isin(df2), where df2 could be a dataframe or some other type, which can contain regex patterns instead of exact matches.

This would make isin much more powerful and very short and succinct to find patterns with..

alphaCTzo7G on 5 Jul 2018

👍4

All 6 comments

It may be easier (and faster) to spread your list over several columns. So the 'Attack Method' column would become is_poison0, is_poision1, is_poison2 is_squeeze0, ... The value of each element would be True or 1 for the Genus that uses that particular poison or squeeze method.

With this setup you use the existing isin method pretty easily I think.

TomAugspurger on 8 May 2014

@TomAugspurger While what you say is true. I made an example and time it here:http://www.reddit.com/r/learnpython/comments/250e0m/list_of_strings_vs_using_regex_patterns_which_is/

It looks like regex is 2x as fast if I use the for loops. So by adding it to the isin method I think it would speed up code by the same amount.

ccsv on 9 May 2014

@ccsv your example is way too short to be a good benchmark. str.contains is vectorized and IS much faster, more readable and along with boolean indexing the way to go.

jreback on 9 May 2014

Just to expand on your example, if you replicate the values till you have a frame thats (1000 x 3)

Vectorized boolean indexing:

In [78]: %timeit df.loc[df.Users.isin(ban_list), 'Banned'] = 'Yes'
1000 loops, best of 3: 429 µs per loop

regex with for loop

In [87]: %%timeit
   ....: for i in range(len(ban_list)):
   ....:     df.loc[df.Users.str.contains(ban_list[i]) & (df.Banned == ''),'Banned'] = 'Yes'
   ....: for i in range(len(Adminlist)):
   ....:     df.loc[df.Users.str.contains(Adminlist[i]) & (df.Banned == ''),'Banned'] = 'Admin'
   ....: 

100 loops, best of 3: 6.85 ms per loop

It's not likely that allowing regexes will speed the first version up at all. isin uses Cython code that works on numpy arrays. I suspect that it will be slower, but you're welcome to try to implement it and post some benchmarks!

TomAugspurger on 9 May 2014

@ccsv closing for now, if you have a different implementation, pls reopen

thanks for the suggestions!

jreback on 9 May 2014