Pandas: DataFrame.replace() overwrites when values are non-numeric

Created on 18 Apr 2017 · 5Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

In [1]: pd.Series([1,2,3]).replace({1 : 2, 2 : 3, 3 : 4})
Out[1]: 
0    2
1    3
2    4
dtype: int64

In [2]: pd.Series(['1','2','3']).replace({'1' : '2', '2' : '3', '3' : '4'})
Out[2]:
0    4
1    4
2    4
dtype: object

Problem description

I'd expect the replacement over values in a dataframe to be non-transitive. Suppose that we would like to replace a with b, and b with c. When this replacement is applied to an entry containing the value a, replacement rules are propagated and therefore c is returned instead of b. Same replacement is not transitive (as shown in example code) for numeric values.

I think this default behavior should be mentioned explicitly in the documentation. It would also be nice to have a Boolean option to set the transitivity on/off.

Expected Output

Out[2]:
0    2
1    3
2    4
dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

Needs Tests good first issue

Source

cx0

Most helpful comment

xref #5541, #5338

chris-b1 on 18 Apr 2017

👍2

All 5 comments

Yeah, I'd consider this more a bug than intended behavior. Deep in the replace code, if dtype=object is being replaced on, a recursive path is used, not entirely sure why, but probably could be changed to to do only 1 pass like you're expecting.

https://github.com/pandas-dev/pandas/blob/2522efa9e687e777d966f49af70b325922699bea/pandas/core/internals.py#L3271

chris-b1 on 18 Apr 2017

👍2

xref #5541, #5338

chris-b1 on 18 Apr 2017

👍2

Today, when replacing a DataFrame with nested mapping {col: {key: value}}, an error is raised when keys and values are overlapping. However, this is actually due to this bug, that happens no matter if the mapping is nested or not.

Wouldn't it be more consistent to raise the error when the keys and values are overlapping and values are non-numeric, instead of only raising when the mapping is nested?

This would render DataFrame replacing with nested mapping {col: {key: value}} usable whenever a loop could be used, while raising an the error in every place it is necessary.

prcastro on 16 Oct 2017

I don't know if I'm hitting the same bug, but replace corrupts a specific DataFrame I tried it against, running Pandas 0.21 on Python 3.6, Linux. To wit:

In [51]: tmp.iloc[15771]
Out[51]: ''

In [52]: tmp.replace('').iloc[15771]
Out[52]: '[email protected]'

however,

In [55]: tmp.replace([''], [None]).iloc[15771] is None
Out[55]: True

I have no idea how [email protected], which is the value of another row, appeared here. This result is repeated for several rows.

petroswork on 9 Dec 2017

Looks like this is fixed on master. Could use a test.

In [88]: In [2]: pd.Series(['1','2','3']).replace({'1' : '2', '2' : '3', '3' : '4'})
    ...:
Out[88]:
0    2
1    3
2    4
dtype: object

mroeschke on 27 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

can't plot multi-row subplots

ericdf · 3Comments

Better display of negative Timedelta

scls19fr · 3Comments

frame _apply_standard error when operating on 0 or NaN values

venuktan · 3Comments

Interpolate (upsample) non-equispaced timeseries into equispaced 18.0rc1

marcelnem · 3Comments

BUG: fillna with inplace does not work with multiple columns selection by loc

hiiwave · 3Comments

Pandas: DataFrame.replace() overwrites when values are non-numeric

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Most helpful comment

All 5 comments

Related issues

Output of `pd.show_versions()`