Pandas: BUG: indexing with DataFrame with nullable boolean dtype

Created on 16 Sep 2020  路  4Comments  路  Source: pandas-dev/pandas

Premise: I tried to look for similar issues (closed or not), but (to my surprise) I couldn't find any.

Problem

I often filter my data through comparison with given values, this breaks using pd.NA.

import pandas as pd
from numpy import nan
from pandas import NA

NA == 1, nan == 1
>> (<NA>, False)

NA != 1, nan != 1
>> (<NA>, True)

NA > 1, nan > 1
>> (<NA>, False)

NA < 1, nan < 1
>> (<NA>, False)

Which implies:

import pandas as pd

df = pd.DataFrame([1,2,NA], dtype="Int8")
df == 2
>>
       0
0     1
1     2
2  <NA>

and, even worse:

df[df == 2]
>>
Traceback (most recent call last):
  [ ... ]
  File "../lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 2861, in _extract_bool_array
    assert mask.dtype == bool, mask.dtype
AssertionError: object

As you surely know, this would have worked flawlessly with numpy.nan.

The solution I'd like

pd.NA and numpy.nan should behave the same, especially in regards of comparisons.

API breaking implications

As far as I know pd.NA has been declared experimental, so this should not break much, but may greatly simplify the transition to it for performance and type consistency purposes (which are, from my point of view, the main advantages).

Describe alternatives you've considered

I'm not entirely sure pd.NA should be the same as nan in all regards. The documentation itself does not imply it at all.
That being said, I'd still like to be able to filter my data without so much pain. :)

Bug NA - MaskedArrays

All 4 comments

The differences in behaviour between np.nan and pd.NA are on purpose (see the discussions in https://github.com/pandas-dev/pandas/issues/28095 and https://github.com/pandas-dev/pandas/issues/28778, but it are long discussions. The first issue links to a summary design document that explains this as well).
But ..

I'd still like to be able to filter my data without so much pain. :)

.. it's certainly still the goal that this works intuitively. Using with a series as example (with your example dataframe):

In [10]: s = df[0]  

In [11]: s   
Out[11]: 
0       1
1       2
2    <NA>
Name: 0, dtype: Int8

In [12]: s == 2  
Out[12]: 
0    False
1     True
2     <NA>
Name: 0, dtype: boolean

In [13]: s[s == 2] 
Out[13]: 
1    2
Name: 0, dtype: Int8

In [14]: df[s == 2]  
Out[14]: 
   0
1  2

The fact that this raises when using a boolean dataframe as filter (df[df == 2]) can be certainly be considered as a bug (or oversight in the initial implementation)

I suspected as much and I can see (and agree with) the reasons that led to choose a different behavior respect to numpy.nan.
I will wait patiently (but eagerly) for updates. :D

Indexing with a nullable boolean DataFrame works on the branch in PR https://github.com/pandas-dev/pandas/pull/36201:

[ins] In [1]: import pandas as pd
         ...:
         ...:
         ...: arr = pd.array([1, 2, None])
         ...: df = pd.DataFrame({"a": arr, "b": arr})
         ...: print(df)
         ...: print(df[df == 1])
         ...:
      a     b
0     1     1
1     2     2
2  <NA>  <NA>
      a     b
0     1     1
1  <NA>  <NA>
2  <NA>  <NA>

@dsaxton thanks for noting! Can you add a test to that PR for this case as well then? (and indicate it will close this issue) And will also try to take a look at that PR then ;)

Was this page helpful?
0 / 5 - 0 ratings