Pandas: datetime64[ns] dtype coerced to object after pd.concat(axis=1)

Created on 4 Oct 2019  路  8Comments  路  Source: pandas-dev/pandas

Code example using pandas 0.25.1

import pandas as pd

df1 = pd.DataFrame([], index=[], columns = ["foo"])
df2 = pd.DataFrame(data=list(range(20)),index=pd.date_range(start='2000', end='2020', freq='A-DEC'),columns=["bar"])
print(df2.index.dtype.name)
# Output: `datetime64[ns]`

df = pd.concat([df1, df2], axis=1)
print(df.index.dtype.name)
# Output: `object`

Problem description

The expected dtype of the concatenated DataFrame df should be datetime64[ns]. Previously, with Pandas 0.24.2, this was the case. After upgrading to pandas 0.25.1 it changed to object.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-64-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_ZA.UTF-8
LOCALE : en_ZA.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 40.4.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10
IPython : None
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.14.1
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : 1.2.1

Bug Dtypes Reshaping

All 8 comments

@mroeschke I would like to work on this, please can you tell which files to look into ?

I think that you probably meant to ping someone else. I am not involved with this issue and don't know the Pandas codebase well.

@techytushar probably pandas/core/reshape/concat.py

I wonder what is the best workaround?

We have some existing code that stopped working because of this, the constructs are typically like:

# create empty dataframe
df = pd.DataFrame()
# add series to this dataframe using pd.concat
s0 = pd.Series([1, 2, 3], index=pd.date_range('2019-01-01', periods=3), name='s1')
s1 = pd.Series([1, 2, 3, 4, 5], index=pd.date_range('2019-01-01', periods=5), name='s2')
df = pd.concat([df, s0], axis=1)
df = pd.concat([df, s1], axis=2)

# len(df) is now 5 but alas, the index now has dtype object 

I have tried the following workarounds (in all code samples, df is an empty dataframe):

a) Assign columns directly:

df[s0.name] = s0
df[s1.name] = s1

# len(df) is 3, so this is not equivavalent (but the index has the correct dtype)

b) Use df.join(how='outer'):

df = df.join(s0, how='outer')
df = df.join(s1, how='outer')

# len(df) is 5, so df.join(how='outer') seems to be equivalent to pd.concat([df, series], axis=1)

c) Use df.assign()

df = df.assign(**{s0.name: s0})
df = df.assign(**{s1.name: s1})

# len(df) is 3, so this is not equivavalent, but the index has the correct dtype

Based on the code above, it seems to me that pd.concat([df, series], axis=1) is equivalent to df.join(series, how='outer').

Thoughts?

Is this confirmed to be a bug? If I understand issues #23525 and PR #23538 correctly this is the expected behavior. @ArtinSarraf ?

pd.concat() behaves as expected if the initial empty dataframe is created with an (empty) DatetimeIndex:

import pandas as pd

df = pd.DataFrame(index=pd.DatetimeIndex(data=[], freq=None))
s = pd.Series([1, 2, 3], index=pd.date_range('2019-01-01', periods=3))
df = pd.concat([df, s], axis=1)
# df.index is now a DatetimeIndex as expected

@codeape2 - this is the expected and (at least previously) desired behavior. As documented here:
https://pandas.pydata.org/pandas-docs/version/0.25/whatsnew/v0.25.0.html#incompatible-index-type-unions
"The dtype of empty Index objects will now be evaluated before performing union operations rather than simply returning the other Index object. Index.union()"

Desired behavior raised here:
https://github.com/pandas-dev/pandas/issues/23525#issuecomment-436473763 (and subsequent response from @jreback)

@codeape2 for workarounds to your issue is there any reason you need start with an empty df?
You can just do pd.concat([s1, s2], axis=1)
Or for example if you were trying to build a frame in a loop it would probably be better to append the individual items to a list and do a single concat at the end.

Was this page helpful?
0 / 5 - 0 ratings