I think it would be nice if regex were compatible with pandas.DataFrame.isin.
Lets say I am inputting the way snakes attack in a dataframe df I would tagging a column df['Attack method'] using a list poison and a list squeeze to check if there are matches in df[Genus]
df['Attack method'] = (df.Genus.isin(poison).map({True:'Poison',False:''}) +
df.Genus.isin(squeeze).map({True:'Squeeze',False:''}))
If I was using Regex I would have to somehow get rid of the user warning
UserWarning: This pattern has match groups. To actually get the groups, use str.extract. " groups, use str.extract.", UserWarning
and create for loops like this:
for i in range(len(poison)):
df.loc[df.Genus.str.contains(poison[i]) & (df.Attack method == ''),'Attack method'] = 'Poison'
for i in range(len(squeeze)):
df.loc[df.Genus.str.contains(squeeze[i]) & (df.Attack method == ''),'Attack method'] = 'Squeeze'
unless there are other ways I missed out?
It may be easier (and faster) to spread your list over several columns. So the 'Attack Method' column would become is_poison0, is_poision1, is_poison2 is_squeeze0, ... The value of each element would be True or 1 for the Genus that uses that particular poison or squeeze method.
With this setup you use the existing isin method pretty easily I think.
@TomAugspurger While what you say is true. I made an example and time it here:http://www.reddit.com/r/learnpython/comments/250e0m/list_of_strings_vs_using_regex_patterns_which_is/
It looks like regex is 2x as fast if I use the for loops. So by adding it to the isin method I think it would speed up code by the same amount.
@ccsv your example is way too short to be a good benchmark. str.contains is vectorized and IS much faster, more readable and along with boolean indexing the way to go.
Just to expand on your example, if you replicate the values till you have a frame thats (1000 x 3)
In [78]: %timeit df.loc[df.Users.isin(ban_list), 'Banned'] = 'Yes'
1000 loops, best of 3: 429 碌s per loop
In [87]: %%timeit
....: for i in range(len(ban_list)):
....: df.loc[df.Users.str.contains(ban_list[i]) & (df.Banned == ''),'Banned'] = 'Yes'
....: for i in range(len(Adminlist)):
....: df.loc[df.Users.str.contains(Adminlist[i]) & (df.Banned == ''),'Banned'] = 'Admin'
....:
100 loops, best of 3: 6.85 ms per loop
It's not likely that allowing regexes will speed the first version up at all. isin uses Cython code that works on numpy arrays. I suspect that it will be slower, but you're welcome to try to implement it and post some benchmarks!
@ccsv closing for now, if you have a different implementation, pls reopen
thanks for the suggestions!
It would be great to have regex capabilities in isin, instead of just perfect matches. Because str.contains only accepts strings, its not possible to do something like df1.isin(df2), where df2 could be a dataframe or some other type, which can contain regex patterns instead of exact matches.
This would make isin much more powerful and very short and succinct to find patterns with..
Most helpful comment
It would be great to have
regexcapabilities inisin, instead of just perfect matches. Becausestr.containsonly acceptsstrings, its not possible to do something likedf1.isin(df2), wheredf2could be a dataframe or some other type, which can containregexpatterns instead of exact matches.This would make
isinmuch more powerful and very short and succinct to find patterns with..