Pandas: combine_first not retaining dtypes

Created on 19 Jun 2014  路  9Comments  路  Source: pandas-dev/pandas

I found a number of issues that seemed related, all closed over a year ago, but there still seem to be some inconsistencies here:

In [1]: from datetime import datetime
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.13.1'
In [4]: dfa = pd.DataFrame([[datetime.now(), 2]], columns=['a','b'])
In [5]: dfb = pd.DataFrame([[4],[5]], columns=['b'])
In [6]: dfa.dtypes
Out[6]: 
a    datetime64[ns]
b             int64
dtype: object
In [7]: dfb.dtypes
Out[7]: 
b    int64
dtype: object
In [8]: # int64 becomes float64 when combining the two frames
In [9]: dfa.combine_first(dfb).dtypes
Out[9]: 
a    datetime64[ns]
b           float64
dtype: object
In [10]: # datetime64[ns] becomes float64 if the first frame is empty
In [11]: dfa.iloc[:0].combine_first(dfb).dtypes
Out[11]: 
a    float64
b      int64
dtype: object
Bug Dtypes Missing-data

Most helpful comment

This is still a bug in 2020:

>>> dfa = pd.DataFrame({"A": [0, 1, 2]}, index=[0, 1, 2])
>>> dfb = pd.DataFrame({"A": [7, 8, 9]}, index=[1, 2, 3])
>>> dfa.combine_first(dfb).dtypes  # Expect to see int64
A    float64
dtype: object

Any plans on addressing this issue?

All 9 comments

can you ref the issues?

My search turned up issues #3041, #3043, #3552, and #3555

(whoa, github's auto-complete on those issue numbers seems totally unrelated...)

none of those address empty types (and thus their are prob no tests).

care to do a pull-request to put in tests/fix?

I haven't looked under the hood; I've no clue how to go about fixing this. Also, there seem to be two separate problems here (perhaps I should have opened two issues):

  1. Specifically when the two DataFrames are both NOT empty, it changes int64 to float64.
  2. When the one with the datetime column IS empty, the datetime dtype is not preserved.

the first is not feasible to fix, since int CANNOT hold nan. It is very tricky to convert this to float, then convert back if necessary (and hits perf).

The 2nd could be fixed (as datetime64[ns] CAN hold na (via NaT))

why don't you write up some tests then....get's you started :)

Oh, of course. The float64 is needed for the nan. But if nan does not result, the int64 is retained:

In [3]: dfa = pd.DataFrame([[datetime.now(), 2]], columns=['a','b'])
In [4]: dfb = pd.DataFrame([[4]], columns=['b'])
In [5]: dfa.combine_first(dfb).dtypes
Out[5]: 
a    datetime64[ns]
b             int64
dtype: object

I should have thought of that.

When you combine_first the original float32 column is transformed to float64:

"""Example of Pandas bug."""
from pandas import DataFrame
from numpy import float32

print('-' * 15)
d1 = DataFrame(index=[0])
d1['A'] = [3.5]
d1['A'] = d1['A'].astype(float32)
print(d1.dtypes)
print('-' * 15)
d2 = DataFrame(index=[0])
d2['B'] = [35]  # if uncomment this line the result is correct, nonsense for me
d2 = d2.combine_first(d1)
print(d2.dtypes)

Current output:

---------------
A    float32
dtype: object
---------------
A    float64
B      int64
dtype: object

Expected output:

---------------
A    float32
dtype: object
---------------
A    float32
B      int64
dtype: object

This is still a bug in 2020:

>>> dfa = pd.DataFrame({"A": [0, 1, 2]}, index=[0, 1, 2])
>>> dfb = pd.DataFrame({"A": [7, 8, 9]}, index=[1, 2, 3])
>>> dfa.combine_first(dfb).dtypes  # Expect to see int64
A    float64
dtype: object

Any plans on addressing this issue?

pandas is an all volunteer project.

if you would like to submit a PR then one of the volunteers would be able to code review

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jaradc picture jaradc  路  3Comments

BDannowitz picture BDannowitz  路  3Comments

nathanielatom picture nathanielatom  路  3Comments

andreas-thomik picture andreas-thomik  路  3Comments

ericdf picture ericdf  路  3Comments