Pandas: DataFrame.replace() overwrites when values are non-numeric

Created on 18 Apr 2017  路  5Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

In [1]: pd.Series([1,2,3]).replace({1 : 2, 2 : 3, 3 : 4})
Out[1]: 
0    2
1    3
2    4
dtype: int64

In [2]: pd.Series(['1','2','3']).replace({'1' : '2', '2' : '3', '3' : '4'})
Out[2]:
0    4
1    4
2    4
dtype: object

Problem description

I'd expect the replacement over values in a dataframe to be non-transitive. Suppose that we would like to replace a with b, and b with c. When this replacement is applied to an entry containing the value a, replacement rules are propagated and therefore c is returned instead of b. Same replacement is not transitive (as shown in example code) for numeric values.

I think this default behavior should be mentioned explicitly in the documentation. It would also be nice to have a Boolean option to set the transitivity on/off.

Expected Output

Out[2]:
0    2
1    3
2    4
dtype: object

Output of pd.show_versions()


INSTALLED VERSIONS


commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

Needs Tests good first issue

Most helpful comment

xref #5541, #5338

All 5 comments

Yeah, I'd consider this more a bug than intended behavior. Deep in the replace code, if dtype=object is being replaced on, a recursive path is used, not entirely sure why, but probably could be changed to to do only 1 pass like you're expecting.

https://github.com/pandas-dev/pandas/blob/2522efa9e687e777d966f49af70b325922699bea/pandas/core/internals.py#L3271

xref #5541, #5338

Today, when replacing a DataFrame with nested mapping {col: {key: value}}, an error is raised when keys and values are overlapping. However, this is actually due to this bug, that happens no matter if the mapping is nested or not.

Wouldn't it be more consistent to raise the error when the keys and values are overlapping and values are non-numeric, instead of only raising when the mapping is nested?

This would render DataFrame replacing with nested mapping {col: {key: value}} usable whenever a loop could be used, while raising an the error in every place it is necessary.

I don't know if I'm hitting the same bug, but replace corrupts a specific DataFrame I tried it against, running Pandas 0.21 on Python 3.6, Linux. To wit:

In [51]: tmp.iloc[15771]
Out[51]: ''

In [52]: tmp.replace('').iloc[15771]
Out[52]: '[email protected]'

however,

In [55]: tmp.replace([''], [None]).iloc[15771] is None
Out[55]: True

I have no idea how [email protected], which is the value of another row, appeared here. This result is repeated for several rows.

Looks like this is fixed on master. Could use a test.

In [88]: In [2]: pd.Series(['1','2','3']).replace({'1' : '2', '2' : '3', '3' : '4'})
    ...:
Out[88]:
0    2
1    3
2    4
dtype: object
Was this page helpful?
0 / 5 - 0 ratings