Pandas: Support for partial string matching in query

Created on 7 Nov 2014  路  7Comments  路  Source: pandas-dev/pandas

Would be nice to have the query method support partial string matching, so you could do the equivalence of df[df['A'].str.contains("abc")] using query: df.query("A contains 'abc'").

API Design Strings

Most helpful comment

It looks like I found a solution, by reading the pandas documentation. I get the behavior I seek by passing engine='python'. The explanation totally makes sense, including the recommendation to avoid doing so unless you really need to, since this would be slow compared to the default option. I'm not sure any additional action is merited.

More specifically, I had to do df.query('A.str.contains("abc"), engine=python) which is maybe not quite as elegant as df.query("A contains 'abc'"), but it is good enough for my purposes.

All 7 comments

sure. pull -requests welcome!

Would it support regex? Or be more like the standard python in operator?

Because I was thinking, similar to

In [13]: df = pd.DataFrame({'a': ['abcde', 'fghij']})

In [14]: 'a' in 'abcd'
Out[14]: True

you could also do:

df.query("'a' in a")

However, I don't know in what sense it does conflict with the current use of in inside query

Maybe I'm missing something, but seems this is still an open issue. I ended up here via a lot of googling. I'm having the same sort of challenge. I've tried in both pandas 0.18.1 and 0.20.1.

I would love to be able to do df.query("A contains 'abc'") as @johanekholm suggested. It's understood that this would be a slower operation than a simpler condition such as == or != but I don't see any downside to having the option.

@rea725 this is an open issue as the tag indicates
if you want this implemented the quickest route would be a pull request to do so

It looks like I found a solution, by reading the pandas documentation. I get the behavior I seek by passing engine='python'. The explanation totally makes sense, including the recommendation to avoid doing so unless you really need to, since this would be slow compared to the default option. I'm not sure any additional action is merited.

More specifically, I had to do df.query('A.str.contains("abc"), engine=python) which is maybe not quite as elegant as df.query("A contains 'abc'"), but it is good enough for my purposes.

Not sure whether I need to open a new issue. If needed, will do.

So I've been using the Series string methods to do some comparisons with a input string. I'm using the

series.str.contains(word,case=False)

and create a new dataframe with the results. What I've observed is that if there is a plus sign (+) in the word I supply for search, the method return 0 zero results
Below is a snippet

import os
import pandas as pd

datadf = pd.DataFrame()
resultdf = pd.DataFrame()

datadf = pd.DataFrame({'description':["i am good boy","i am a bad boy","i am an ugly boy","i am a + boy"]})
print(datadf)
word = "i am a + boy"
resultdf=resultdf.append(datadf[datadf['description'].str.contains(word,case=False)])
print(len(resultdf.index))

The issue is not only with +, but also with *

image

Also does this have to do anything with the below note on official docs ?

image

Closing. Contributions welcome

Was this page helpful?
0 / 5 - 0 ratings