Pandas: BUG: None is not equal to None

Created on 21 Mar 2018 · 17Comments · Source: pandas-dev/pandas

If you have None in a series, it will not be considered equal to None (even in the same series).

A = pd.Series([None]*3)
A == A

gives

0    False
1    False
2    False
dtype: bool

but, of course

A.values == A.values

gives

array([ True,  True,  True])

I can't find any reference to this behaviour in the documentation and looks quite unnatural.

Output of `pd.show_versions()`

Here the output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: None
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug Missing-data

Source

mapio

👍10

Most helpful comment

There is no way to address this without a major point release, since any change here will break backwards compatibility.

Nevertheless, this issue does imply the clear need for documentation improvement: Nones selectively behave as nans.

walkerh on 9 Apr 2018

👍3

All 17 comments

Also a bit strange that == is inconsistent with the eq method in this case:

In [2]: pd.__version__
Out[2]: '0.23.0.dev0+658.g17c1fad'

In [3]: s = pd.Series([None]*3)

In [4]: s
Out[4]:
0    None
1    None
2    None
dtype: object

In [5]: s == s
Out[5]:
0    False
1    False
2    False
dtype: bool

In [6]: s.eq(s)
Out[6]:
0    True
1    True
2    True
dtype: bool

jschendel on 21 Mar 2018

👍1

I believe this occurs because this comparison hits this block:

https://github.com/pandas-dev/pandas/blob/17c1fadb06bc8f48828e17e603a4d6228a02d73c/pandas/_libs/ops.pyx#L162-L163

And the checknull function resolves down this path which considers None as null:

https://github.com/pandas-dev/pandas/blob/17c1fadb06bc8f48828e17e603a4d6228a02d73c/pandas/_libs/src/util.pxd#L156-L158

mroeschke on 22 Mar 2018

see the warnings box: http://pandas.pydata.org/pandas-docs/stable/missing_data.html

This was done quite a while ago to make the behavior of nulls consistent, in that they don't compare equal. This puts None and np.nan on an equal (though not-consistent with python, BUT consistent with numpy) footing.

So this is not a bug, rather a consequence of stradling 2 conventions.

I suppose the documentation could be slightly enhanced.

jreback on 22 Mar 2018

👍2 👎1

however this should be consistent with this:In [7]: s = pd.Series([np.nan]*3)

In [8]: s == s
Out[8]: 
0    False
1    False
2    False
dtype: bool

In [9]: s.eq(s)
Out[9]: 
0    False
1    False
2    False
dtype: bool

so [6] form @jschendel should match [9] here. The incorrect case is actually .eq

jreback on 22 Mar 2018

👍3

here's the dichotomy in scalars

In [11]: None == None
Out[11]: True

In [12]: np.nan == np.nan
Out[12]: False

jreback on 22 Mar 2018

I do not agree with @jreback and the current status of equality for None in pandas.

I've seen the warning box and I completely agree that, according to IEEE floating-point standard, "a NaN is never equal to any other number (including another NaN)" (cited from What Every Computer Scientist Should Know About Floating-Point Arithmetic).

I've no problems, if a data type of a series is numeric, to accept that NaNs are not equal and even that None get converted to NaNs.

But it a series contains objects (let them be string, list, or whatever), the equality should be consistent with Python equality for such objects.

It's a basic application of the principle of least astonishment. I don't see any benefit in mangling None if the data type is non numeric.

mapio on 22 Mar 2018

@mapio you are missing the point. None is NOT the default missing indicator for anything. np.nan IS. None is accepted for compatibility.

jreback on 22 Mar 2018

Where in the documentation is stated that None is the "missing indicator"? And, if this is the case, I find incredibly confusing that

>>> pd.Series([1,None,2])
0    1.0
1    NaN
2    2.0
dtype: float64

but

>>> pd.Series(['a', None, 'b'])
0       a
1    None
2       b
dtype: object

If you pick na.nan as a missing data indicator, than you must convert every None in a series to it, not only in numeric series, but also in series of objects. To be clear, I expect the second example to look

>>> pd.Series(['a', None, 'b'])
0       a
1    NaN
2       b
dtype: object

Given such a convention I would have probably still asked why you decide to translate a None to a NaN, but I would have not been surprised at all to have a False comparing the second entry of the two series.

Probably I'm missing the point, but the current situation is, at best, inconsistent (and as such, is very confusing).

mapio on 22 Mar 2018

http://pandas-docs.github.io/pandas-docs-travis/missing_data.html#values-considered-missing

@mapio how familiar are you with NumPy / pandas type system? When you do Series([1, None, 2]), we infer it as floats. If you want python objects, specify that.

In [5]: s = pd.Series([1, None, 2], dtype=object)

In [6]: s
Out[6]:
0       1
1    None
2       2
dtype: object

But as you've seen, many places in pandas, especially in ops like ==, treat None as missing.

TomAugspurger on 22 Mar 2018

👎1

I'm quite familiar with NumPy. My point is the exact opposite of what you say.

Since when you build a series from object in Pandas it will correctly infer the object data type, None are left as such. But when you compare them (and only then), they are silently translated to NaNs and (due to IEEE way of handling equality among NaNs) they result to be different.

It makes complete sense in numeric series, where None are clearly converted to NaNs.

My point is:

it's ok to convert to NaNs if you want to have NaN the only way to represent missing values,
it's ok to compare NaNs as different (of course),
you must perform such conversion visibly every time you set a series element to None!

If you keep None as element of object series and convert them silently to NaNs before comparing them (with == or eq) you are surprising your users (and also your developers, given that this is reported as a bug with regard to eq).

No implicit conversion to NaN should happen. If you store objects, you should compare them as Python does.

mapio on 22 Mar 2018

this is a long ago argument. np.nan is the missing value indicator for ALL dtypes, including object. This is not going to change. Yes we allow None to be explicity passed for object dtypes and be preserved (but only in this case), as object can essentially hold anything.

So comparisons must follow the convention that np.nan behavior is what we do.'
leaving this issue open for the sole purpose of fixing [9] above.

jreback on 23 Mar 2018

If you choose to convert missing values to NaNs for ALL (allcaps your) dtypes, a decision I can understand and second), I still miss why this does not hold for object dtype!

Please, consider converting None to NaN also for object series, this will make missing values handling consistent and equality consistent. I can't see any drawback…

mapio on 23 Mar 2018

I think this will be hard to change as pandas is currently structured. We allow both None and NaN as the missing value indicator in object columns.

In [14]: a = pd.Series(['a', None, float('nan')])

In [15]: a
Out[15]:
0       a
1    None
2     NaN
dtype: object

In [16]: a == a
Out[16]:
0     True
1    False
2    False
dtype: bool

I can't see any drawback…

IIRC, people found it useful to not immediately convert None to NaN there. They wanted to keep their python scalars. I don't recall details though.

If we wanted to change this, which I'm not sure we do, I don't see a deprecation path to changing None to compare equal to itself, though perhaps you do.

How is this affecting you negatively?

TomAugspurger on 23 Mar 2018

I think that people finds useful to not immediately convert to NaN and keep their Python types because they want the objects they put in the series to behave like Python objects. This benefit is lost if you change the way equality works for some of the Python types (like None)!

Consider this piece of code:

In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])

In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True

In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True

I expect that In [4] should tell me if the two series are equal member by member, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as in In[5] instead I get (as expected) that the series are equal member by member .

Now, if I turn the series to numpy.arrays as in In[5] I got the expected result. Here NumPy is consistent with the vectorization in In[6].

As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:

pandas coverts all missing values except for None to NaN,
pandas (according to IEEE) treats NaN as different from themselves.

Having an inconsistent behaviour and leaving None being as such but comparing them as different violates the principle of least astonishment.

I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal!

I don't think anyone can really think that an inconsistent behaviour is a good thing…

mapio on 23 Mar 2018

👍3

Do you see a way to deprecate this cleanly (along with the behavior of isna
/ notna, groupby, etc.)?

On Fri, Mar 23, 2018 at 6:23 AM, Massimo Santini notifications@github.com
wrote:

I think that people finds useful to not immediately convert to NaN and
keep their Python types because they want the objects they put in the
series to behave like Python objects. This benefit is lost if you
change the way equality works for some of the Python types (like None)!

Consider this piece of code:

In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])

In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True

In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True
´´´

I expect that In [4] should tell me if the two series are equal member by member, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as in In[5] instead I get (as expected) that the series are equal member by member .

Now, if I turn the series to numpy.arrays as in In[5] I got the expected result. Here NumPy is consistent with the vectorization in In[6].

As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:

pandas coverts all missing values except for ´None´ to NaN,* pandas (according to IEEE) treats NaN as different from themselves.

Having an inconsistent behaviour and leaving None being as such but comparing them as different violates the principle of least astonishment.

I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal!

I don't think anyone can really think that an inconsistent behaviour is a good thing…

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/20442#issuecomment-375628761,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIi3CLCGdGyp5eaOvh7P8bjWEmLzEks5thNtEgaJpZM4S1qvp
.

TomAugspurger on 23 Mar 2018

I am just a Pandas user, not a developer. What I can suggest is to look for the code that converts None to NaN in numeric series constructors and copy it to object series constructors. And document it in the session of the documentation where the handling of missing values.

Observe that this is not likely to cause any particular harm. If one puts a None in a series it's not likely to invoke methods on it, so that changing it to np.nan will not make much of a difference. Especially since Series.isnull is (as stated in the documentation) an alias of Series.isna, so that code that was looking for None will still find it, whatever method it will call to test for it.

If some developer offers to fix the code, I volunteer to fix the docs 😄

mapio on 23 Mar 2018

There is no way to address this without a major point release, since any change here will break backwards compatibility.

Nevertheless, this issue does imply the clear need for documentation improvement: Nones selectively behave as nans.

walkerh on 9 Apr 2018

👍3

Was this page helpful?

0 / 5 - 0 ratings