When I left-join two data frames on an int64 column, this column is cast to a float if it is the index of one of the two data frames:
>>> id = [1, 143349558619761320]
>>> a = pd.DataFrame({'a': [1, 2]}, index=pd.Int64Index(id, name='id'))
>>> b = pd.DataFrame({'id': id[1], 'b': [2]})
>>> merge_index = pd.merge(a, b, left_index=True, right_on='id', how='left')
>>> print(merge_index.dtypes)
a int64
b float64
id float64
dtype: object
In this case this is a problem because casting back to an int changes the value of this column:
>>> int(float(143349558619761320))
143349558619761312
If the int64 column is not in the index, the column is not cast to a float:
>>> merge_column = pd.merge(a.reset_index(), b, on='id', how='left')
>>> print(merge_column.dtypes)
id int64
a int64
b float64
dtype: object
I understand that integer columns get cast to a float if NaNs are introduced (like column b here), but in this case the final column contains no missing values, so casting to a float can be avoided.
Output from pd.show_versions():
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-5-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.0
nose: 1.3.4
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
I just met exactly the same problem with having casted float64 after left joining with two int64 fields which one of them is an index of left dataframe.
And here's output from show_versions.
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-49-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.2
nose: 1.3.6
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 3.1.0
sphinx: 1.3.1
patsy: None
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9.1
apiclient: 1.4.0
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
merge has the same error:
testb = pd.DataFrame([{"a":2,"c":2}])
testa = pd.DataFrame([{"a":1,"b":2}])
result = pd.merge(testa, testb, on=["a"], how="outer")
Just hit the same problem. However, if how='inner
, it's not an issue.
Just encountered the same issue. Here is how I fixed it
# old code - casts 'eCountryID' to float64
merged = pd.merge(
analyzed, ecountry_data[['CountryCode', 'eCountryID']],
how='left', on=['CountryCode'])
# new code
merged_country_id = pd.merge(
analyzed, ecountry_data[['CountryCode', 'eCountryID']],
how='inner', on=['CountryCode']).astype(object)
no_country_id = analyzed[pd.isnull(analyzed['CountryCode'])]
merged = pd.concat([merged_country_id, no_country_id])
this is a dupe of #8596, closed by #13170 which will be in 0.18.2
see http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#merging-changes
Should a new issue be opened for the outer
join issue highlighted by @jiamo ?
@flipdazed this looks ok
In [1]: testb = pd.DataFrame([{"a":2,"c":2}])
...: testa = pd.DataFrame([{"a":1,"b":2}])
...: result = pd.merge(testa, testb, on=["a"], how="outer")
...:
In [2]: testa
Out[2]:
a b
0 1 2
In [3]: testb
Out[3]:
a c
0 2 2
In [4]: result
Out[4]:
a b c
0 1 2.0 NaN
1 2 NaN 2.0
In [5]: result.dtypes
Out[5]:
a int64
b float64
c float64
dtype: object
How to compare result['a'][1] and result['c'][1] after outer merge? they should be equal.
In [1]: testb = pd.DataFrame([{"a":2,"c":2}])
...: testa = pd.DataFrame([{"a":1,"b":2}])
...: result = pd.merge(testa, testb, on=["a"], how="outer")
...:
In [2]: testa
Out[2]:
a b
0 1 2
In [3]: testb
Out[3]:
a c
0 2 2
In [4]: result
Out[4]:
a b c
0 1 2.0 NaN
1 2 NaN 2.0
Edit: What I see happening is actually a join casting ints to floats if the result of the join contains NaN. I'm not 100% sure, but I think this is the expected behavior. Sorry for the confusion.
I see this still happening in 0.23.2. Did it sneak in again? (Left join with int index as described above)
It is happening for me in version 0.23.3, using outer join.
Edit: I agree with teoguso's edit, the casting only occurs if the join subsequently contains NaNs. That explains the casting then.
Edit2: https://pandas.pydata.org/pandas-docs/stable/missing_data.html explains the casting rules for anyone (such as myself) who was unaware of them.
@Muff2n please see my edit above.
Most helpful comment
Edit: What I see happening is actually a join casting ints to floats if the result of the join contains NaN. I'm not 100% sure, but I think this is the expected behavior. Sorry for the confusion.
I see this still happening in 0.23.2. Did it sneak in again? (Left join with int index as described above)