Pandas: ENH/API: check for full rows of DataFrame with isin?

Created on 28 May 2014 · 8Comments · Source: pandas-dev/pandas

Seeing this SO question: http://stackoverflow.com/questions/23896088/dropping-rows-in-dataframe-if-found-in-another-one/, I was wondering if this is something that could be provided as functionality to the .isin() method.

The problem at the moment to use isin to check for the occurence of a full row in a DataFrame is that a) isin checks for the values in each column seperately, indepenently of the values in other columns (so you cannot check if the values occur together in the same row) and b) isin also checks if the index label matches.

Or are there better ways to check for the occurence of a full row in a DataFrame?

API Design Enhancement Reshaping isin

Source

jorisvandenbossche

👍2

Most helpful comment

My work around (Python 3):

import pandas as pd
from functools import reduce

a = pd.DataFrame([[1, 2], [1, 2], [3,4], [3, 4]])
b = a.sample(1)

def isin_row(a, b, cols=None):
    cols = cols or a.columns
    return reduce(lambda x, y:x&y, [a[f].isin(b[f]) for f in cols])

print(isin_row(a, b))

The result is something like this:

0    False
1    False
2     True
3     True
dtype: bool

which can be used to select rows in the original dataframe.

fingertap on 9 Aug 2018

👍6

All 8 comments

wasn't their discussion of a ignore_index=False kw for isin? @TomAugspurger

jreback on 28 May 2014

I think that would be a usefull addition. And it solves my point b), but it is not enough to do this row checking, since the columns are still handled seperately.

With an example:

In [250]: df = DataFrame({'A': [1, 2, 3, 4], 'B': ['a', 'b', 'c', 'd']})
In [251]: df
Out[251]: 
   A  B
0  1  a
1  2  b
2  3  c
3  4  d

In [252]: other = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'c', 'd']})
In [253]: other
Out[253]: 
   A  B
0  1  a
1  2  c
2  3  d

isin now also checks for the index label, so this gives:

In [254]: df.isin(other)
Out[254]: 
       A      B
0   True   True
1   True  False
2   True  False
3  False  False
4  False  False

With something like ignore_index=True, this could be achieved at this moment with a to_dict, and gives something like this:

In [258]: df.isin(other.to_dict('list'))   # or df.isin(other, ignore_index=True)
Out[258]: 
       A      B
0   True   True
1   True  False
2   True   True
3  False   True


In [259]: df.isin(other.to_dict('list')).all(1)
Out[259]: 
0     True
1    False
2     True
3    False
dtype: bool

However, if you want to check for entire rows, the result should be something like this:

In [259]: df.isin(other, check_entire_row=True, ignore_index=True)
Out[259]: 
0     True
1    False
2    False
3    False
dtype: bool

Problem with this that this also changes the output shape

jorisvandenbossche on 28 May 2014

There was a discussion about an ignore_index kwarg. I'll look back at the PR, but I think I just left it as a todo, nothing against it in principle.

This seems useful, I could look into doing a PR next week maybe.

TomAugspurger on 29 May 2014

👍1

Yes, this was a todo, but I think we were trying for only one kwarg (and struggling). I like your argument that there should be two.

hayd on 29 May 2014

👍1

This would be very useful to have, the workaround isn't entirely obvious.

pemontto on 27 Jun 2017

Any updates?

JurijsNazarovs on 5 Mar 2018

@JurijsNazarovs This is an open issue, so code contributions to make this actually happen are very welcome.

jorisvandenbossche on 5 Mar 2018

My work around (Python 3):

import pandas as pd
from functools import reduce

a = pd.DataFrame([[1, 2], [1, 2], [3,4], [3, 4]])
b = a.sample(1)

def isin_row(a, b, cols=None):
    cols = cols or a.columns
    return reduce(lambda x, y:x&y, [a[f].isin(b[f]) for f in cols])

print(isin_row(a, b))

The result is something like this:

0    False
1    False
2     True
3     True
dtype: bool

which can be used to select rows in the original dataframe.

fingertap on 9 Aug 2018

👍6

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ValueError plotting bar plot from DataFrame with existing Axes

swails · 3Comments

Cannot use apply on Series with Timestamp values

nathanielatom · 3Comments

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity?

jaradc · 3Comments

Incompatibility between pandas.infer_freq and pandas.to_timedelta

idanivanov · 3Comments

Interpolate (upsample) non-equispaced timeseries into equispaced 18.0rc1

marcelnem · 3Comments