Pandas: BUG: inconsistent replace

Created on 22 Jul 2020 · 21Comments · Source: pandas-dev/pandas

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[x] (optional) I have confirmed this bug exists on the master branch of pandas.

Problem description

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

Problem description

Maybe I don't understand somethink or this is just non-sens

Expected Output

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  1.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-62-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0
Cython : None
pytest : 5.4.3
hypothesis : None
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.6.2
lxml.etree : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : None
scipy : 1.5.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.50.1

Bug Regression replace

Source

qlieumontadv

All 21 comments

Can confirm this occurs on master as well.

erfannariman on 22 Jul 2020

I'm guessing this has something to do with downcasting? floats can be downcasted to ints which is why the replace using floats works for both cases but for ints it only replaces ints

asishm on 22 Jul 2020

Thanks @asishm for the report. This definitely looks like a bug. from https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.replace.html

numeric: numeric values equal to to_replace will be replaced with value

so the definition of equal needs clarification.

but if we consider pandas' notion of equals, just like python, 1 == 1.0

>>> pd.DataFrame([[1, 1.0], [2, 2.0]]) == 1
       0      1
0   True   True
1  False  False
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]) == 1.0
       0      1
0   True   True
1  False  False
>>>

I would therefore expect, for consistency, the result to be

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

simonjayhawkins on 22 Jul 2020

0.25.3 was giving the expected output, so marking as regression for now pending further invesitigation on whether the change was intentional

>>> pd.__version__
'0.25.3'
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0
>>>

simonjayhawkins on 22 Jul 2020

👍1

the behaviour changed with #27768, so not intentional.

01f90c187f0eec0e8178371d7c066e600c9e105b is the first bad commit
commit 01f90c187f0eec0e8178371d7c066e600c9e105b
Author: jbrockmendel jbrockmendel@gmail.com
Date: Mon Aug 12 11:58:42 2019 -0700

CLN: short-circuit case in Block.replace (#27768)

cc @jbrockmendel

simonjayhawkins on 22 Jul 2020

😕1

After a few checks, I've find that this lines are at the bug source (from @simonjayhawkins ) :

if not isinstance(to_replace, list):
    if inplace:
        return [self]
    return [self.copy()]

With this lines, I have :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

Without this lines, I have :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

qlieumontadv on 17 Aug 2020

In fact with the 4 lines, I can force changes by placing 1.0 between brackets :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace([1.0], 5)
   0    1
0  5  5.0
1  2  2.0

qlieumontadv on 17 Aug 2020

Edit: qlieumontadv was my company account, I will do a PR with my personal account (this one :)

QuentinN42 on 9 Sep 2020

@jbrockmendel can we remove this lines without affecting the results ?

if not isinstance(to_replace, list):
    if inplace:
        return [self]
    return [self.copy()]

I have the impression that it allows to increase performance by not testing some cases as explained in the comment that was deleted during this commit :

# TODO: we should be able to infer at this point that there is
#  nothing to replace

QuentinN42 on 9 Sep 2020

can we remove this lines without affecting the results ?

The line before the ones you quoted is if not self._can_hold_element(to_replace):, so we should only get here if the array doesn't contain to_replace. If we are getting here with the example from the OP, that suggests a problem in _can_hold_element

jbrockmendel on 9 Sep 2020

👀1

After some tests, I may have found the problem:
In the pandas/core/internals/blocks.py file, in the IntBlock class, if the maybe_infer_dtype_type function returns None (which is normal), it will return is_integer(element).
By replacing it with is_integer(element) or is_float(element) this case is covered.

I'm not very satisfied with this fix, I feel it's more of a DIY fix.
May I open a PR ?

QuentinN42 on 17 Sep 2020

Here is the changes I made :
is_float added to IntBlock._can_hold_element gist

QuentinN42 on 17 Sep 2020

By replacing it with is_integer(element) or is_float(element) this case is covered.

is_float(element) is a start, but don't we only want to include floats that are int-like? so (is_float(element) and element.is_integer()). to be even more careful we should check that casting to the target dtype is lossless

jbrockmendel on 18 Sep 2020

👍1

I've create a PR for this.
I've NOT run any tests or code style checks !

QuentinN42 on 18 Sep 2020

It might be usefull to add some unit tests for this.

QuentinN42 on 18 Sep 2020

It might be usefull to add some unit tests for this.

This is required as part of any PR. use the example in the OP as a test.

simonjayhawkins on 18 Sep 2020

Sorry but what is an OP ? :disappointed:

QuentinN42 on 18 Sep 2020

it normally means 'original poster'. I actually meant the first post. so use the code sample in https://github.com/pandas-dev/pandas/issues/35376#issue-663620223

simonjayhawkins on 18 Sep 2020

Ok, thx :smile:

QuentinN42 on 18 Sep 2020

I'm working on the test for the PR and I want to put # GH xxx in comment, is it the issue or the PR number ?
Aka # GH 35376 or # GH 36444 ?

QuentinN42 on 18 Sep 2020

the issue number, i.e. 35376

simonjayhawkins on 18 Sep 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Incompatibility between pandas.infer_freq and pandas.to_timedelta

idanivanov · 3Comments

Suffixes ignored on second merge

MatzeB · 3Comments

read_csv(filename_with_asian_locale) failed in python 3.6 for windows

mfmain · 3Comments

ValueError plotting bar plot from DataFrame with existing Axes

swails · 3Comments

Dataframe creation: Specifying dtypes with a dictionary

amelio-vazquez-reina · 3Comments

Pandas: BUG: inconsistent replace

Problem description

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

All 21 comments

Related issues

Output of `pd.show_versions()`