Pandas: pd.to_numeric - float64, object or error?

Created on 18 Jul 2017 · 13Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import pandas as pd

print pd.__version__

#different result example
d = pd.DataFrame({'a':[200, 300, '', 'NaN', 10000000000000000000]})

#returns dtype object
a = pd.to_numeric(d['a'], errors='coerce')
print a.dtype

#return dtype float64
b = d['a'].apply(pd.to_numeric, errors='coerce')
print b.dtype

#why not float64?
d = pd.DataFrame({'a':[200, 300, '', 'NaN', 30000000000000000000]})

#returns dtype object
a = pd.to_numeric(d['a'], errors='coerce')
print a.dtype

#returns OverflowError
b = d['a'].apply(pd.to_numeric, errors='coerce')
print b.dtype

Problem description

Hi guys, I realized that result of to_numeric changes depending on the way you pass a Series to that function. Please see example above. When I call to_numeric with series passed as parameter, it returns "object", but when I apply to_numeric to that series, it returns "float64". Moreover, I'm a bit confused what's the correct behavior of to_numeric, why it doesn't convert looooong int-like number to float64? It throws an exception from which I can't even deduce which number (position, index) caused that exception...

I'm pretty sure my issue is being discussed somewhere already, I tried to search the proper issue but rather found bits and pieces about to_numeric and convertions in general. Please feel free to put my issue in more appropriate thread.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

API Design Bug Dtypes

Source

mficek

Most helpful comment

I think that that errors='coerce' should force everything to float64 if anything is missing. It's not fundamentally different than, our normal treatment of ints in the presence of missing data, and it's an opt-in, deliberately lossy kwarg.

In [6]: pd.to_numeric([1, 2, ''], errors='coerce')
Out[6]: array([  1.,   2.,  nan])

That said, I do see your point that in the range of int64max to uint64max it becomes "lossier"

chris-b1 on 18 Jul 2017

👍2

All 13 comments

This case is clearly a bug:

In [18]: pd.to_numeric([200, 300, '', 'NaN', 10000000000000000000], errors='coerce')
Out[18]: array([200, 300, '', 'NaN', 10000000000000000000], dtype=object)

But not entirely clear how we should be handling "big" integers. numpy does not cast to a float on default construction, but we could here.

In [22]: np.array([10000000000000000000])
Out[22]: array([10000000000000000000], dtype=uint64)

In [23]: np.array([30000000000000000000])
Out[23]: array([30000000000000000000], dtype=object)

chris-b1 on 18 Jul 2017

@chris-b1 : Your first example, I can explain. You can't hold both NaN and uint64 together in the same numeric dtype without losing precision, so we will refuse to cast if that's the case. That's why you get the output that you get, so no, that is not a bug IMO.

As for your point about "big" integers, do keep in mind that you can't hold integers above np.iinfo(np.uinit64).max. That's just the way numpy dtypes are currently designed. Thus, the second box is moot at this point.

gfyoung on 18 Jul 2017

@mficek : My explanation about not holding NaN and uint64 together in the same numeric dtype applies for your two examples where you called pd.to_numeric on the entire Series.

For your first .apply example, you should try printing out the output of the to_numeric call each time and the data type. What you will find is that the numeric types will not match (you have uint64, int64, and float). In this case, we just collate the outputs together into an array-like (we have no idea that the function being used is to_numeric, so precision is moot), leading to the forced casting of float64.

Of the four examples, only your last one I would consider a bug. When we coerce, we should not crash because any "bad" elements are replaced with NaN. That means we need to catch OverflowError somewhere in the implementation.

gfyoung on 18 Jul 2017

The bug is that we are returning the object unchanged with errors='coerce' - anything unconvertible should be NaN
@mficek was suggesting that we upcast to float in that scenario (which happens to be lossless with this particular value)

In [27]: np.array([30000000000000000000], dtype='float64')
Out[27]: array([  3.00000000e+19])

chris-b1 on 18 Jul 2017

The bug is that we are returning the object unchanged with errors='coerce' - anything unconvertible should be NaN

Ah, I see...I thought it was erroring an element earlier. Yes, that NaN should just be an empty string.

@mficek was suggesting that we upcast to float in that scenario (which happens to be lossless with this particular value)

We made a deliberate decision to not do that. uint64 --> float64 casting is not something we want to do because we more often than not are losing precision. I suppose you could try to check each number to see if it won't lose precision, but that's going to be a performance hit I can imagine.

gfyoung on 18 Jul 2017

Thanks for your explanation! I figured some workaround that works for me with pandas==0.20.3 (basically converting to string, adding '.' at the end of every "number" which leads to conversion to 'float64' always and then checking for elements which are bigger than np.iinfo(np.uint64).max, which I replace with np.nan. Then I drop nans and covert everything to uint64 (which is the type I need in my code).

    d[col] = (d[col].astype(str) + '.').apply(pd.to_numeric, errors='coerce')
    d.loc[d[col]>np.iinfo(typ).max, col] = np.nan
    d[col] = d[col].dropna()
    d[col] = d[col].astype(np.uint64)

Nothing nice, but it works. Can I be anyhow helpful in resolving this issue? Despite being a big fan of pandas, I never contributed to it's code...

mficek on 18 Jul 2017

@chris-b1 : I actually can explain why the first and third examples behave as they do. When you try to convert the "NaN" to nan, that conflicts with the uint64 at the end. Thus, we refuse to coerce (or mutate) any values because of that potential loss of precision.

To be fair, this is not well-documented, and we should at least document this (unless people really think this behavior should be changed).

gfyoung on 18 Jul 2017

Can I be anyhow helpful in resolving this issue?

@mficek : Absolutely! The following code locations will help you out here:

1) to_numeric can be found here

2) maybe_convert_numeric can be found here (you will encounter this function in the implementation of to_numeric)

Try to figure out why the OverflowError is being raised in your last example. It seems to be triggered from maybe_convert_objects (you can find that here), but walk through the code and see if you can see why.

Finally, here is the link to documentation for contributing to pandas - that will also give you instructions on how to set up a development environment that you don't overwrite your existing pandas installation.

gfyoung on 18 Jul 2017

In [6]: pd.to_numeric([1, 2, ''], errors='coerce')
Out[6]: array([  1.,   2.,  nan])

That said, I do see your point that in the range of int64max to uint64max it becomes "lossier"

chris-b1 on 18 Jul 2017

👍2

It's not fundamentally different than, our normal treatment of ints in the presence of missing data, and it's an opt-in, deliberately lossy kwarg.

@chris-b1 : I would disagree with you on that. uint is a different beast from int, which is why we have had problems with handling them for some time. For starters, it has no nan like int does, which makes it trickier to handle it in conjunction with missing data.

Also, having working on to_numeric before, previous discussion on this topic has centered around making conversion operations like these as least precision-destructive as possible. Thus, I would push-back against making coercion "lossy" for uint64 given the issues I've had to patch to prevent it from happening in the first place.

gfyoung on 18 Jul 2017

For starters, it has no nan like int does,

int cannot represent nan either

In [1]: np.array([1, 2, np.nan]).dtype
Out[1]: dtype('float64')

chris-b1 on 18 Jul 2017

int cannot represent nan either

Sorry, what I meant is that we don't have a placeholder for missing uint like we do for int (I don't count nan because of the precision lossiness).

gfyoung on 18 Jul 2017

I agree with @chris-b1 here.

It doesn't matter what is there when we have errors='coerce'. If its not lowest common dtype convertible, then it gets a NaN. Now during the conversion you may be inferring and you see a uint64, then a NaN so we go back to object, but I agree that is buggy, it should return float64 if it has to coerce.