Pandas: ENH/API: check for full rows of DataFrame with isin?

Created on 28 May 2014  路  8Comments  路  Source: pandas-dev/pandas

Seeing this SO question: http://stackoverflow.com/questions/23896088/dropping-rows-in-dataframe-if-found-in-another-one/, I was wondering if this is something that could be provided as functionality to the .isin() method.

The problem at the moment to use isin to check for the occurence of a full row in a DataFrame is that a) isin checks for the values in each column seperately, indepenently of the values in other columns (so you cannot check if the values occur together in the same row) and b) isin also checks if the index label matches.

Or are there better ways to check for the occurence of a full row in a DataFrame?

API Design Enhancement Reshaping isin

Most helpful comment

My work around (Python 3):

import pandas as pd
from functools import reduce

a = pd.DataFrame([[1, 2], [1, 2], [3,4], [3, 4]])
b = a.sample(1)

def isin_row(a, b, cols=None):
    cols = cols or a.columns
    return reduce(lambda x, y:x&y, [a[f].isin(b[f]) for f in cols])

print(isin_row(a, b))

The result is something like this:

0    False
1    False
2     True
3     True
dtype: bool

which can be used to select rows in the original dataframe.

All 8 comments

wasn't their discussion of a ignore_index=False kw for isin? @TomAugspurger

I think that would be a usefull addition. And it solves my point b), but it is not enough to do this row checking, since the columns are still handled seperately.

With an example:

In [250]: df = DataFrame({'A': [1, 2, 3, 4], 'B': ['a', 'b', 'c', 'd']})
In [251]: df
Out[251]: 
   A  B
0  1  a
1  2  b
2  3  c
3  4  d

In [252]: other = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'c', 'd']})
In [253]: other
Out[253]: 
   A  B
0  1  a
1  2  c
2  3  d

isin now also checks for the index label, so this gives:

In [254]: df.isin(other)
Out[254]: 
       A      B
0   True   True
1   True  False
2   True  False
3  False  False
4  False  False

With something like ignore_index=True, this could be achieved at this moment with a to_dict, and gives something like this:

In [258]: df.isin(other.to_dict('list'))   # or df.isin(other, ignore_index=True)
Out[258]: 
       A      B
0   True   True
1   True  False
2   True   True
3  False   True


In [259]: df.isin(other.to_dict('list')).all(1)
Out[259]: 
0     True
1    False
2     True
3    False
dtype: bool

However, if you want to check for entire rows, the result should be something like this:

In [259]: df.isin(other, check_entire_row=True, ignore_index=True)
Out[259]: 
0     True
1    False
2    False
3    False
dtype: bool

Problem with this that this also changes the output shape

There was a discussion about an ignore_index kwarg. I'll look back at the PR, but I think I just left it as a todo, nothing against it in principle.

This seems useful, I could look into doing a PR next week maybe.

Yes, this was a todo, but I think we were trying for only one kwarg (and struggling). I like your argument that there should be two.

This would be very useful to have, the workaround isn't entirely obvious.

Any updates?

@JurijsNazarovs This is an open issue, so code contributions to make this actually happen are very welcome.

My work around (Python 3):

import pandas as pd
from functools import reduce

a = pd.DataFrame([[1, 2], [1, 2], [3,4], [3, 4]])
b = a.sample(1)

def isin_row(a, b, cols=None):
    cols = cols or a.columns
    return reduce(lambda x, y:x&y, [a[f].isin(b[f]) for f in cols])

print(isin_row(a, b))

The result is something like this:

0    False
1    False
2     True
3     True
dtype: bool

which can be used to select rows in the original dataframe.

Was this page helpful?
0 / 5 - 0 ratings