Seeing this SO question: http://stackoverflow.com/questions/23896088/dropping-rows-in-dataframe-if-found-in-another-one/, I was wondering if this is something that could be provided as functionality to the .isin()
method.
The problem at the moment to use isin
to check for the occurence of a full row in a DataFrame is that a) isin
checks for the values in each column seperately, indepenently of the values in other columns (so you cannot check if the values occur together in the same row) and b) isin
also checks if the index label matches.
Or are there better ways to check for the occurence of a full row in a DataFrame?
wasn't their discussion of a ignore_index=False
kw for isin
? @TomAugspurger
I think that would be a usefull addition. And it solves my point b), but it is not enough to do this row checking, since the columns are still handled seperately.
With an example:
In [250]: df = DataFrame({'A': [1, 2, 3, 4], 'B': ['a', 'b', 'c', 'd']})
In [251]: df
Out[251]:
A B
0 1 a
1 2 b
2 3 c
3 4 d
In [252]: other = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'c', 'd']})
In [253]: other
Out[253]:
A B
0 1 a
1 2 c
2 3 d
isin
now also checks for the index label, so this gives:
In [254]: df.isin(other)
Out[254]:
A B
0 True True
1 True False
2 True False
3 False False
4 False False
With something like ignore_index=True
, this could be achieved at this moment with a to_dict
, and gives something like this:
In [258]: df.isin(other.to_dict('list')) # or df.isin(other, ignore_index=True)
Out[258]:
A B
0 True True
1 True False
2 True True
3 False True
In [259]: df.isin(other.to_dict('list')).all(1)
Out[259]:
0 True
1 False
2 True
3 False
dtype: bool
However, if you want to check for entire rows, the result should be something like this:
In [259]: df.isin(other, check_entire_row=True, ignore_index=True)
Out[259]:
0 True
1 False
2 False
3 False
dtype: bool
Problem with this that this also changes the output shape
There was a discussion about an ignore_index
kwarg. I'll look back at the PR, but I think I just left it as a todo, nothing against it in principle.
This seems useful, I could look into doing a PR next week maybe.
Yes, this was a todo, but I think we were trying for only one kwarg (and struggling). I like your argument that there should be two.
This would be very useful to have, the workaround isn't entirely obvious.
Any updates?
@JurijsNazarovs This is an open issue, so code contributions to make this actually happen are very welcome.
My work around (Python 3):
import pandas as pd
from functools import reduce
a = pd.DataFrame([[1, 2], [1, 2], [3,4], [3, 4]])
b = a.sample(1)
def isin_row(a, b, cols=None):
cols = cols or a.columns
return reduce(lambda x, y:x&y, [a[f].isin(b[f]) for f in cols])
print(isin_row(a, b))
The result is something like this:
0 False
1 False
2 True
3 True
dtype: bool
which can be used to select rows in the original dataframe.
Most helpful comment
My work around (Python 3):
The result is something like this:
which can be used to select rows in the original dataframe.