Pandas: BUG: inconsistent replace

Created on 22 Jul 2020  路  21Comments  路  Source: pandas-dev/pandas

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [x] (optional) I have confirmed this bug exists on the master branch of pandas.


Problem description

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

Problem description

Maybe I don't understand somethink or this is just non-sens

Expected Output

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  1.0
1  2  2.0

Or

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-62-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0
Cython : None
pytest : 5.4.3
hypothesis : None
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.6.2
lxml.etree : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : None
scipy : 1.5.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.50.1

Bug Regression replace

All 21 comments

Can confirm this occurs on master as well.

I'm guessing this has something to do with downcasting? floats can be downcasted to ints which is why the replace using floats works for both cases but for ints it only replaces ints

Thanks @asishm for the report. This definitely looks like a bug. from https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.replace.html

numeric: numeric values equal to to_replace will be replaced with value

so the definition of equal needs clarification.

but if we consider pandas' notion of equals, just like python, 1 == 1.0

>>> pd.DataFrame([[1, 1.0], [2, 2.0]]) == 1
       0      1
0   True   True
1  False  False
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]) == 1.0
       0      1
0   True   True
1  False  False
>>>

I would therefore expect, for consistency, the result to be

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

0.25.3 was giving the expected output, so marking as regression for now pending further invesitigation on whether the change was intentional

>>> pd.__version__
'0.25.3'
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0
>>>

the behaviour changed with #27768, so not intentional.

01f90c187f0eec0e8178371d7c066e600c9e105b is the first bad commit
commit 01f90c187f0eec0e8178371d7c066e600c9e105b
Author: jbrockmendel jbrockmendel@gmail.com
Date: Mon Aug 12 11:58:42 2019 -0700

CLN: short-circuit case in Block.replace (#27768)

cc @jbrockmendel

After a few checks, I've find that this lines are at the bug source (from @simonjayhawkins ) :

if not isinstance(to_replace, list):
    if inplace:
        return [self]
    return [self.copy()]

With this lines, I have :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

Without this lines, I have :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

In fact with the 4 lines, I can force changes by placing 1.0 between brackets :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace([1.0], 5)
   0    1
0  5  5.0
1  2  2.0

Edit: qlieumontadv was my company account, I will do a PR with my personal account (this one :)

@jbrockmendel can we remove this lines without affecting the results ?

if not isinstance(to_replace, list):
    if inplace:
        return [self]
    return [self.copy()]

I have the impression that it allows to increase performance by not testing some cases as explained in the comment that was deleted during this commit :

# TODO: we should be able to infer at this point that there is
#  nothing to replace

can we remove this lines without affecting the results ?

The line before the ones you quoted is if not self._can_hold_element(to_replace):, so we should only get here if the array doesn't contain to_replace. If we are getting here with the example from the OP, that suggests a problem in _can_hold_element

After some tests, I may have found the problem:
In the pandas/core/internals/blocks.py file, in the IntBlock class, if the maybe_infer_dtype_type function returns None (which is normal), it will return is_integer(element).
By replacing it with is_integer(element) or is_float(element) this case is covered.

I'm not very satisfied with this fix, I feel it's more of a DIY fix.
May I open a PR ?

By replacing it with is_integer(element) or is_float(element) this case is covered.

is_float(element) is a start, but don't we only want to include floats that are int-like? so (is_float(element) and element.is_integer()). to be even more careful we should check that casting to the target dtype is lossless

I've create a PR for this.
I've NOT run any tests or code style checks !

It might be usefull to add some unit tests for this.

It might be usefull to add some unit tests for this.

This is required as part of any PR. use the example in the OP as a test.

Sorry but what is an OP ? :disappointed:

it normally means 'original poster'. I actually meant the first post. so use the code sample in https://github.com/pandas-dev/pandas/issues/35376#issue-663620223

Ok, thx :smile:

I'm working on the test for the PR and I want to put # GH xxx in comment, is it the issue or the PR number ?
Aka # GH 35376 or # GH 36444 ?

the issue number, i.e. 35376

Was this page helpful?
0 / 5 - 0 ratings