Premise: I tried to look for similar issues (closed or not), but (to my surprise) I couldn't find any.
I often filter my data through comparison with given values, this breaks using pd.NA.
import pandas as pd
from numpy import nan
from pandas import NA
NA == 1, nan == 1
>> (<NA>, False)
NA != 1, nan != 1
>> (<NA>, True)
NA > 1, nan > 1
>> (<NA>, False)
NA < 1, nan < 1
>> (<NA>, False)
Which implies:
import pandas as pd
df = pd.DataFrame([1,2,NA], dtype="Int8")
df == 2
>>
0
0 1
1 2
2 <NA>
and, even worse:
df[df == 2]
>>
Traceback (most recent call last):
[ ... ]
File "../lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 2861, in _extract_bool_array
assert mask.dtype == bool, mask.dtype
AssertionError: object
As you surely know, this would have worked flawlessly with numpy.nan.
pd.NA and numpy.nan should behave the same, especially in regards of comparisons.
As far as I know pd.NA has been declared experimental, so this should not break much, but may greatly simplify the transition to it for performance and type consistency purposes (which are, from my point of view, the main advantages).
I'm not entirely sure pd.NA should be the same as nan in all regards. The documentation itself does not imply it at all.
That being said, I'd still like to be able to filter my data without so much pain. :)
The differences in behaviour between np.nan and pd.NA are on purpose (see the discussions in https://github.com/pandas-dev/pandas/issues/28095 and https://github.com/pandas-dev/pandas/issues/28778, but it are long discussions. The first issue links to a summary design document that explains this as well).
But ..
I'd still like to be able to filter my data without so much pain. :)
.. it's certainly still the goal that this works intuitively. Using with a series as example (with your example dataframe):
In [10]: s = df[0]
In [11]: s
Out[11]:
0 1
1 2
2 <NA>
Name: 0, dtype: Int8
In [12]: s == 2
Out[12]:
0 False
1 True
2 <NA>
Name: 0, dtype: boolean
In [13]: s[s == 2]
Out[13]:
1 2
Name: 0, dtype: Int8
In [14]: df[s == 2]
Out[14]:
0
1 2
The fact that this raises when using a boolean dataframe as filter (df[df == 2]) can be certainly be considered as a bug (or oversight in the initial implementation)
I suspected as much and I can see (and agree with) the reasons that led to choose a different behavior respect to numpy.nan.
I will wait patiently (but eagerly) for updates. :D
Indexing with a nullable boolean DataFrame works on the branch in PR https://github.com/pandas-dev/pandas/pull/36201:
[ins] In [1]: import pandas as pd
...:
...:
...: arr = pd.array([1, 2, None])
...: df = pd.DataFrame({"a": arr, "b": arr})
...: print(df)
...: print(df[df == 1])
...:
a b
0 1 1
1 2 2
2 <NA> <NA>
a b
0 1 1
1 <NA> <NA>
2 <NA> <NA>
@dsaxton thanks for noting! Can you add a test to that PR for this case as well then? (and indicate it will close this issue) And will also try to take a look at that PR then ;)