Pandas: BUG: Nullable integer type cols become 'object' dtype by concatenation

Created on 13 Nov 2019 · 3Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd                                                                                                                                                                        

In [2]: foo = pd.DataFrame({'a': [1]}).astype('Int64')                                                                                                                                             

In [3]: bar = pd.DataFrame({'a': [2], 'b': [3]}).astype('Int64')                                                                                                                                   

In [4]: pd.concat((foo, bar), sort=False)                                                                                                                                                          
Out[4]: 
   a    b
0  1  NaN
0  2    3

In [5]: pd.concat((foo, bar), sort=False).dtypes                                                                                                                                                   
Out[5]: 
a     Int64
b    object
dtype: object

Problem description

As shown in the code above, pd.concat(foo, bar) adds column 'b' to foo by filling it NaN and stacks them.
In this time, I expect the column 'b' still hold Int64 because it accepts NaN but current behavior is not.

Expected Output

In [5]: pd.concat((foo, bar), sort=False).dtypes                                                                                                                                                   
Out[5]: 
a     Int64
b    Int64
dtype: object

Output of `pd.show_versions()`

In [2]: pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : 3.10.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Bug Duplicate ExtensionArray

Source

cafeal

All 3 comments

Thanks for the report!

This is related to https://github.com/pandas-dev/pandas/issues/22994, although that issue is about multiple extension dtype blocks with different dtypes. While here it is concatting with a non-existent block in the other frame.