Pandas: combine_first not retaining dtypes

Created on 19 Jun 2014 · 9Comments · Source: pandas-dev/pandas

I found a number of issues that seemed related, all closed over a year ago, but there still seem to be some inconsistencies here:

In [1]: from datetime import datetime
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.13.1'
In [4]: dfa = pd.DataFrame([[datetime.now(), 2]], columns=['a','b'])
In [5]: dfb = pd.DataFrame([[4],[5]], columns=['b'])
In [6]: dfa.dtypes
Out[6]: 
a    datetime64[ns]
b             int64
dtype: object
In [7]: dfb.dtypes
Out[7]: 
b    int64
dtype: object
In [8]: # int64 becomes float64 when combining the two frames
In [9]: dfa.combine_first(dfb).dtypes
Out[9]: 
a    datetime64[ns]
b           float64
dtype: object
In [10]: # datetime64[ns] becomes float64 if the first frame is empty
In [11]: dfa.iloc[:0].combine_first(dfb).dtypes
Out[11]: 
a    float64
b      int64
dtype: object

Bug Dtypes Missing-data

Source

altaurog

👍2

Most helpful comment

This is still a bug in 2020:

>>> dfa = pd.DataFrame({"A": [0, 1, 2]}, index=[0, 1, 2])
>>> dfb = pd.DataFrame({"A": [7, 8, 9]}, index=[1, 2, 3])
>>> dfa.combine_first(dfb).dtypes  # Expect to see int64
A    float64
dtype: object

Any plans on addressing this issue?

lsorber on 23 Mar 2020

👍2

All 9 comments

can you ref the issues?

jreback on 19 Jun 2014

My search turned up issues #3041, #3043, #3552, and #3555

(whoa, github's auto-complete on those issue numbers seems totally unrelated...)

altaurog on 19 Jun 2014

none of those address empty types (and thus their are prob no tests).

care to do a pull-request to put in tests/fix?

jreback on 19 Jun 2014

I haven't looked under the hood; I've no clue how to go about fixing this. Also, there seem to be two separate problems here (perhaps I should have opened two issues):

Specifically when the two DataFrames are both NOT empty, it changes int64 to float64.
When the one with the datetime column IS empty, the datetime dtype is not preserved.

altaurog on 19 Jun 2014

the first is not feasible to fix, since int CANNOT hold nan. It is very tricky to convert this to float, then convert back if necessary (and hits perf).

The 2nd could be fixed (as datetime64[ns] CAN hold na (via NaT))

why don't you write up some tests then....get's you started :)

jreback on 19 Jun 2014

Oh, of course. The float64 is needed for the nan. But if nan does not result, the int64 is retained:

In [3]: dfa = pd.DataFrame([[datetime.now(), 2]], columns=['a','b'])
In [4]: dfb = pd.DataFrame([[4]], columns=['b'])
In [5]: dfa.combine_first(dfb).dtypes
Out[5]: 
a    datetime64[ns]
b             int64
dtype: object

I should have thought of that.

altaurog on 19 Jun 2014

When you combine_first the original float32 column is transformed to float64:

"""Example of Pandas bug."""
from pandas import DataFrame
from numpy import float32

print('-' * 15)
d1 = DataFrame(index=[0])
d1['A'] = [3.5]
d1['A'] = d1['A'].astype(float32)
print(d1.dtypes)
print('-' * 15)
d2 = DataFrame(index=[0])
d2['B'] = [35]  # if uncomment this line the result is correct, nonsense for me
d2 = d2.combine_first(d1)
print(d2.dtypes)

Current output:

---------------
A    float32
dtype: object
---------------
A    float64
B      int64
dtype: object

Expected output:

---------------
A    float32
dtype: object
---------------
A    float32
B      int64
dtype: object

VelizarVESSELINOV on 25 Feb 2016

👍1

This is still a bug in 2020:

>>> dfa = pd.DataFrame({"A": [0, 1, 2]}, index=[0, 1, 2])
>>> dfb = pd.DataFrame({"A": [7, 8, 9]}, index=[1, 2, 3])
>>> dfa.combine_first(dfb).dtypes  # Expect to see int64
A    float64
dtype: object

Any plans on addressing this issue?

lsorber on 23 Mar 2020

👍2

pandas is an all volunteer project.

if you would like to submit a PR then one of the volunteers would be able to code review

jreback on 23 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings