Pandas: Left join on an integer column casts the column to float if it is in one of the indexes

Created on 21 Apr 2015 · 11Comments · Source: pandas-dev/pandas

When I left-join two data frames on an int64 column, this column is cast to a float if it is the index of one of the two data frames:

>>> id = [1, 143349558619761320]
>>> a = pd.DataFrame({'a': [1, 2]}, index=pd.Int64Index(id, name='id'))
>>> b = pd.DataFrame({'id': id[1], 'b': [2]})
>>> merge_index = pd.merge(a, b, left_index=True, right_on='id', how='left')
>>> print(merge_index.dtypes)
a       int64
b     float64
id    float64
dtype: object

In this case this is a problem because casting back to an int changes the value of this column:

>>> int(float(143349558619761320))
143349558619761312

If the int64 column is not in the index, the column is not cast to a float:

>>> merge_column = pd.merge(a.reset_index(), b, on='id', how='left')
>>> print(merge_column.dtypes)
id      int64
a       int64
b     float64
dtype: object

I understand that integer columns get cast to a float if NaNs are introduced (like column b here), but in this case the final column contains no missing values, so casting to a float can be avoided.

Output from pd.show_versions():

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-5-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.0
nose: 1.3.4
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

Bug Dtypes Duplicate Reshaping

Source

bembom

👍3

Most helpful comment

Edit: What I see happening is actually a join casting ints to floats if the result of the join contains NaN. I'm not 100% sure, but I think this is the expected behavior. Sorry for the confusion.

I see this still happening in 0.23.2. Did it sneak in again? (Left join with int index as described above)

teoguso on 7 Jul 2018

👍14

All 11 comments

I just met exactly the same problem with having casted float64 after left joining with two int64 fields which one of them is an index of left dataframe.

And here's output from show_versions.

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-49-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.2
nose: 1.3.6
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 3.1.0
sphinx: 1.3.1
patsy: None
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9.1
apiclient: 1.4.0
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None

midnightradio on 8 Oct 2015

merge has the same error:

testb = pd.DataFrame([{"a":2,"c":2}])
testa = pd.DataFrame([{"a":1,"b":2}])
result = pd.merge(testa, testb, on=["a"], how="outer")

jiamo on 27 Apr 2016

👍2

Just hit the same problem. However, if how='inner, it's not an issue.

wendykan on 19 May 2016

Just encountered the same issue. Here is how I fixed it

# old code - casts 'eCountryID' to float64
merged = pd.merge(
    analyzed, ecountry_data[['CountryCode', 'eCountryID']],
    how='left', on=['CountryCode'])


# new code
merged_country_id = pd.merge(
    analyzed, ecountry_data[['CountryCode', 'eCountryID']],
    how='inner', on=['CountryCode']).astype(object)

no_country_id = analyzed[pd.isnull(analyzed['CountryCode'])]

merged = pd.concat([merged_country_id, no_country_id])

PedroMDuarte on 12 Jun 2016

this is a dupe of #8596, closed by #13170 which will be in 0.18.2

see http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#merging-changes

jreback on 12 Jun 2016

Should a new issue be opened for the outer join issue highlighted by @jiamo ?

flipdazed on 22 May 2017

@flipdazed this looks ok

In [1]: testb = pd.DataFrame([{"a":2,"c":2}])
   ...: testa = pd.DataFrame([{"a":1,"b":2}])
   ...: result = pd.merge(testa, testb, on=["a"], how="outer")
   ...: 

In [2]: testa
Out[2]: 
   a  b
0  1  2

In [3]: testb
Out[3]: 
   a  c
0  2  2

In [4]: result
Out[4]: 
   a    b    c
0  1  2.0  NaN
1  2  NaN  2.0

In [5]: result.dtypes
Out[5]: 
a      int64
b    float64
c    float64
dtype: object

jreback on 2 Jun 2017

How to compare result['a'][1] and result['c'][1] after outer merge? they should be equal.

In [1]: testb = pd.DataFrame([{"a":2,"c":2}])
 ...: testa = pd.DataFrame([{"a":1,"b":2}])
 ...: result = pd.merge(testa, testb, on=["a"], how="outer")
 ...: 

In [2]: testa
Out[2]: 
   a  b
0  1  2

In [3]: testb
Out[3]: 
   a  c
0  2  2

In [4]: result
Out[4]: 
   a    b    c
0  1  2.0  NaN
1  2  NaN  2.0

verahan on 24 Jul 2017

Edit: What I see happening is actually a join casting ints to floats if the result of the join contains NaN. I'm not 100% sure, but I think this is the expected behavior. Sorry for the confusion.

I see this still happening in 0.23.2. Did it sneak in again? (Left join with int index as described above)

teoguso on 7 Jul 2018

👍14

It is happening for me in version 0.23.3, using outer join.

Edit: I agree with teoguso's edit, the casting only occurs if the join subsequently contains NaNs. That explains the casting then.

Edit2: https://pandas.pydata.org/pandas-docs/stable/missing_data.html explains the casting rules for anyone (such as myself) who was unaware of them.