I found a number of issues that seemed related, all closed over a year ago, but there still seem to be some inconsistencies here:
In [1]: from datetime import datetime
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.13.1'
In [4]: dfa = pd.DataFrame([[datetime.now(), 2]], columns=['a','b'])
In [5]: dfb = pd.DataFrame([[4],[5]], columns=['b'])
In [6]: dfa.dtypes
Out[6]:
a datetime64[ns]
b int64
dtype: object
In [7]: dfb.dtypes
Out[7]:
b int64
dtype: object
In [8]: # int64 becomes float64 when combining the two frames
In [9]: dfa.combine_first(dfb).dtypes
Out[9]:
a datetime64[ns]
b float64
dtype: object
In [10]: # datetime64[ns] becomes float64 if the first frame is empty
In [11]: dfa.iloc[:0].combine_first(dfb).dtypes
Out[11]:
a float64
b int64
dtype: object
can you ref the issues?
My search turned up issues #3041, #3043, #3552, and #3555
(whoa, github's auto-complete on those issue numbers seems totally unrelated...)
none of those address empty types (and thus their are prob no tests).
care to do a pull-request to put in tests/fix?
I haven't looked under the hood; I've no clue how to go about fixing this. Also, there seem to be two separate problems here (perhaps I should have opened two issues):
the first is not feasible to fix, since int CANNOT hold nan. It is very tricky to convert this to float, then convert back if necessary (and hits perf).
The 2nd could be fixed (as datetime64[ns] CAN hold na (via NaT))
why don't you write up some tests then....get's you started :)
Oh, of course. The float64 is needed for the nan. But if nan does not result, the int64 is retained:
In [3]: dfa = pd.DataFrame([[datetime.now(), 2]], columns=['a','b'])
In [4]: dfb = pd.DataFrame([[4]], columns=['b'])
In [5]: dfa.combine_first(dfb).dtypes
Out[5]:
a datetime64[ns]
b int64
dtype: object
I should have thought of that.
When you combine_first the original float32 column is transformed to float64:
"""Example of Pandas bug."""
from pandas import DataFrame
from numpy import float32
print('-' * 15)
d1 = DataFrame(index=[0])
d1['A'] = [3.5]
d1['A'] = d1['A'].astype(float32)
print(d1.dtypes)
print('-' * 15)
d2 = DataFrame(index=[0])
d2['B'] = [35] # if uncomment this line the result is correct, nonsense for me
d2 = d2.combine_first(d1)
print(d2.dtypes)
Current output:
---------------
A float32
dtype: object
---------------
A float64
B int64
dtype: object
Expected output:
---------------
A float32
dtype: object
---------------
A float32
B int64
dtype: object
This is still a bug in 2020:
>>> dfa = pd.DataFrame({"A": [0, 1, 2]}, index=[0, 1, 2])
>>> dfb = pd.DataFrame({"A": [7, 8, 9]}, index=[1, 2, 3])
>>> dfa.combine_first(dfb).dtypes # Expect to see int64
A float64
dtype: object
Any plans on addressing this issue?
pandas is an all volunteer project.
if you would like to submit a PR then one of the volunteers would be able to code review
Most helpful comment
This is still a bug in 2020:
Any plans on addressing this issue?