If you have None
in a series, it will not be considered equal to None
(even in the same series).
A = pd.Series([None]*3)
A == A
gives
0 False
1 False
2 False
dtype: bool
but, of course
A.values == A.values
gives
array([ True, True, True])
I can't find any reference to this behaviour in the documentation and looks quite unnatural.
pd.show_versions()
Here the output of pd.show_versions()
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: None
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Also a bit strange that ==
is inconsistent with the eq
method in this case:
In [2]: pd.__version__
Out[2]: '0.23.0.dev0+658.g17c1fad'
In [3]: s = pd.Series([None]*3)
In [4]: s
Out[4]:
0 None
1 None
2 None
dtype: object
In [5]: s == s
Out[5]:
0 False
1 False
2 False
dtype: bool
In [6]: s.eq(s)
Out[6]:
0 True
1 True
2 True
dtype: bool
I believe this occurs because this comparison hits this block:
And the checknull
function resolves down this path which considers None as null:
see the warnings box: http://pandas.pydata.org/pandas-docs/stable/missing_data.html
This was done quite a while ago to make the behavior of nulls consistent, in that they don't compare equal. This puts None
and np.nan
on an equal (though not-consistent with python, BUT consistent with numpy) footing.
So this is not a bug, rather a consequence of stradling 2 conventions.
I suppose the documentation could be slightly enhanced.
however this should be consistent with this:In [7]: s = pd.Series([np.nan]*3)
In [8]: s == s
Out[8]:
0 False
1 False
2 False
dtype: bool
In [9]: s.eq(s)
Out[9]:
0 False
1 False
2 False
dtype: bool
so [6] form @jschendel should match [9] here. The incorrect case is actually .eq
here's the dichotomy in scalars
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False
I do not agree with @jreback and the current status of equality for None
in pandas.
I've seen the warning box and I completely agree that, according to IEEE floating-point standard, "a NaN is never equal to any other number (including another NaN)" (cited from What Every Computer Scientist Should Know About Floating-Point Arithmetic).
I've no problems, if a data type of a series is numeric, to accept that NaNs are not equal and even that None
get converted to NaNs.
But it a series contains objects (let them be string, list, or whatever), the equality should be consistent with Python equality for such objects.
It's a basic application of the principle of least astonishment. I don't see any benefit in mangling None
if the data type is non numeric.
@mapio you are missing the point. None
is NOT the default missing indicator for anything. np.nan
IS. None
is accepted for compatibility.
Where in the documentation is stated that None
is the "missing indicator"? And, if this is the case, I find incredibly confusing that
>>> pd.Series([1,None,2])
0 1.0
1 NaN
2 2.0
dtype: float64
but
>>> pd.Series(['a', None, 'b'])
0 a
1 None
2 b
dtype: object
If you pick na.nan
as a missing data indicator, than you must convert every None
in a series to it, not only in numeric series, but also in series of objects. To be clear, I expect the second example to look
>>> pd.Series(['a', None, 'b'])
0 a
1 NaN
2 b
dtype: object
Given such a convention I would have probably still asked why you decide to translate a None
to a NaN, but I would have not been surprised at all to have a False
comparing the second entry of the two series.
Probably I'm missing the point, but the current situation is, at best, inconsistent (and as such, is very confusing).
http://pandas-docs.github.io/pandas-docs-travis/missing_data.html#values-considered-missing
@mapio how familiar are you with NumPy / pandas type system? When you do Series([1, None, 2])
, we infer it as floats. If you want python objects, specify that.
In [5]: s = pd.Series([1, None, 2], dtype=object)
In [6]: s
Out[6]:
0 1
1 None
2 2
dtype: object
But as you've seen, many places in pandas, especially in ops like ==
, treat None as missing.
I'm quite familiar with NumPy. My point is the exact opposite of what you say.
Since when you build a series from object in Pandas it will correctly infer the object
data type, None
are left as such. But when you compare them (and only then), they are silently translated to NaNs and (due to IEEE way of handling equality among NaNs) they result to be different.
It makes complete sense in numeric series, where None
are clearly converted to NaNs.
My point is:
None
!If you keep None
as element of object
series and convert them silently to NaNs before comparing them (with ==
or eq
) you are surprising your users (and also your developers, given that this is reported as a bug with regard to  eq
).
No implicit conversion to NaN should happen. If you store objects, you should compare them as Python does.
this is a long ago argument. np.nan
is the missing value indicator for ALL dtypes, including object
. This is not going to change. Yes we allow None
to be explicity passed for object
dtypes and be preserved (but only in this case), as object
can essentially hold anything.
So comparisons must follow the convention that np.nan
behavior is what we do.'
leaving this issue open for the sole purpose of fixing [9] above.
If you choose to convert missing values to NaNs for ALL (allcaps your) dtypes, a decision I can understand and second), I still miss why this does not hold for object dtype!
Please, consider converting None
to NaN also for object series, this will make missing values handling consistent and equality consistent. I can't see any drawback…
I think this will be hard to change as pandas is currently structured. We allow both None and NaN as the missing value indicator in object columns.
In [14]: a = pd.Series(['a', None, float('nan')])
In [15]: a
Out[15]:
0 a
1 None
2 NaN
dtype: object
In [16]: a == a
Out[16]:
0 True
1 False
2 False
dtype: bool
I can't see any drawback…
IIRC, people found it useful to not immediately convert None
to NaN
there. They wanted to keep their python scalars. I don't recall details though.
If we wanted to change this, which I'm not sure we do, I don't see a deprecation path to changing None
to compare equal to itself, though perhaps you do.
How is this affecting you negatively?
I think that people finds useful to not immediately convert to NaN and keep their Python types because they want the objects they put in the series to behave like Python objects. This benefit is lost if you change the way equality works for some of the Python types (like None
)!
Consider this piece of code:
In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])
In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True
In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True
I expect that In [4]
should tell me if the two series are equal member by member, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as in In[5]
instead I get (as expected) that the series are equal member by member .
Now, if I turn the series to numpy.array
s as in In[5]
I got the expected result. Here NumPy is consistent with the vectorization in In[6]
.
As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:
None
to NaN,Having an inconsistent behaviour and leaving None
being as such but comparing them as different violates the principle of least astonishment.
I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal!
I don't think anyone can really think that an inconsistent behaviour is a good thing…
Do you see a way to deprecate this cleanly (along with the behavior of isna
/ notna, groupby, etc.)?
On Fri, Mar 23, 2018 at 6:23 AM, Massimo Santini notifications@github.com
wrote:
I think that people finds useful to not immediately convert to NaN and
keep their Python types because they want the objects they put in the
series to behave like Python objects. This benefit is lost if you
change the way equality works for some of the Python types (like None)!Consider this piece of code:
In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: TrueIn [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True
´´´I expect that
In [4]
should tell me if the two series are equal member by member, vectorizing in some sense the equality. Currently they turn out not to be! But if I do test equality member by member as inIn[5]
instead I get (as expected) that the series are equal member by member .Now, if I turn the series to
numpy.array
s as inIn[5]
I got the expected result. Here NumPy is consistent with the vectorization inIn[6]
.As you can clearly see there is an inconsistency that is very hard to understand (you have to know, and there is no place in the docs where this is clearly stated) that:
- pandas coverts all missing values except for ´None´ to NaN,* pandas (according to IEEE) treats NaN as different from themselves.
Having an inconsistent behaviour and leaving
None
being as such but comparing them as different violates the principle of least astonishment.I spent quite some time (with two dataframes of tens of thousands of elements of different dtypes) to get why there was a difference when, looking at the corresponding rows, all the values where equal!
I don't think anyone can really think that an inconsistent behaviour is a good thing…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/20442#issuecomment-375628761,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIi3CLCGdGyp5eaOvh7P8bjWEmLzEks5thNtEgaJpZM4S1qvp
.
I am just a Pandas user, not a developer. What I can suggest is to look for the code that converts None
to NaN in numeric series constructors and copy it to object series constructors. And document it in the session of the documentation where the handling of missing values.
Observe that this is not likely to cause any particular harm. If one puts a None
in a series it's not likely to invoke methods on it, so that changing it to np.nan
will not make much of a difference. Especially since Series.isnull
is (as stated in the documentation) an alias of Series.isna
, so that code that was looking for None
will still find it, whatever method it will call to test for it.
If some developer offers to fix the code, I volunteer to fix the docs 😄
There is no way to address this without a major point release, since any change here will break backwards compatibility.
Nevertheless, this issue does imply the clear need for documentation improvement: None
s selectively behave as nan
s.
Most helpful comment
There is no way to address this without a major point release, since any change here will break backwards compatibility.
Nevertheless, this issue does imply the clear need for documentation improvement:
None
s selectively behave asnan
s.