Pandas: REGR: setitem with integer slices on Int/RangeIndex is broken (label instead of positional)

Created on 30 Jan 2020 · 15Comments · Source: pandas-dev/pandas

There's an backward incompatible change in pandas 1.0 that I didn't find in the changelog. I might have just overlooked it though.

import numpy as np
X = pd.DataFrame(np.zeros((100, 1)))
X[-4:] = 1
X

In pandas 0.25.3 or lower, this results in the last four entries of X to be 1 and all the others zero. In pandas 1.0, it results in all entries of X being 1.
I assume it's a change of indexing axis 0 or axis 1?

Indexing Regression

Source

amueller

All 15 comments

I wonder if it's related to #31449 but I'm not using a multi-index.

amueller on 30 Jan 2020

👍1

Thanks for the report.

Seems this doesn't affect .iloc:

In [26]: import numpy as np 
    ...: X = pd.DataFrame(np.zeros((5, 1))) 
    ...: X.iloc[-4:] = 1 
    ...: X                                                                      
Out[26]: 
     0
0  0.0
1  1.0
2  1.0
3  1.0
4  1.0

will look into it

MarcoGorelli on 30 Jan 2020

👍1

you are label indexing with a slice with loc
since none of the labels exist nothing is set

did this actually work previously?

this should never have worked with .loc

it might have with [] which has fallback integer indexing

jreback on 30 Jan 2020

I do not remember any specific discussion about this, so I think it is definitely a regression.

Slicing rows in [] has always worked positional if there is an integer index (surprising, yes, but longstanding behaviour, see eg my summary of this of 5 years ago https://github.com/pandas-dev/pandas/issues/9595)

jorisvandenbossche on 30 Jan 2020

since none of the labels exist nothing is set

@jreback if I've understood correctly, the issue is that _everything_ is being set

>>> import numpy as np
>>> import pandas as pd

>>> X = pd.DataFrame(np.zeros((5, 1)))
>>> X                                                                     
     0
0  0.0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:]  # only prints the last 4 rows of X...
     0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:] = 1
>>> X  # ...but everything (including the first row) has now been set
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0

MarcoGorelli on 30 Jan 2020

I do not remember any specific discussion about this, so I think it is definitely a regression.

Slicing rows in [] has always worked positional if there is an integer index (surprising, yes, but longstanding behaviour, see eg my summary of this of 5 years ago #9595)

maybe but indexing with an out or range label on both sides should return nothing

so the results are correct

jreback on 30 Jan 2020

since none of the labels exist nothing is set

@jreback if I've understood correctly, the issue is that _everything_ is being set

>>> import numpy as np
>>> import pandas as pd

>>> X = pd.DataFrame(np.zeros((5, 1)))
>>> X                                                                     
     0
0  0.0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:]  # only prints the last 4 rows of X...
     0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:] = 1
>>> X  # ...but everything (including the first row) has now been set
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0

ahh ok that is not correct; i would expect this indexer to return noting

jreback on 30 Jan 2020

might be https://github.com/pandas-dev/pandas/pull/31393

jreback on 30 Jan 2020

ahh ok that is not correct; i would expect this indexer to return noting

Asking for the shape, both in 0.25 and 1.0, you get

>>> X[-4:].shape
(4, 1)

but assignment in version 1.0 assigns to everything.

amueller on 30 Jan 2020

maybe but indexing with an out or range label on both sides should return nothing

This is about positional indexing, so there is no "out of range label". The -4 means start from the fourth last element to the end.

Again, I agree this is surprising behaviour. You would think it is label-based indexing, but it is not. I already described this 5 years in ago #9595.

Some examples to illustrate this:

In [21]: df = pd.DataFrame({'a': [0., 1., 2., 3.]}, index=[2, 3, 4, 5])

In [22]: df 
Out[22]: 
     a
2  0.0
3  1.0
4  2.0
5  3.0

In [23]: df[2:] 
Out[23]: 
     a
4  2.0
5  3.0

In [24]: df[:3]  
Out[24]: 
     a
2  0.0
3  1.0
4  2.0

This those examples are for __getitem__, and work clearly positionally if you look at the index of the results (and both on 0.25 and 1.0, and for both Int64Index as RangeIndex).
And so it is __setitem__ is broken in 1.0.0.

jorisvandenbossche on 31 Jan 2020

👍1

This is caused by https://github.com/pandas-dev/pandas/pull/27383 I think (cc @jbrockmendel ), specifically:

     def _setitem_slice(self, key, value):
         self._check_setitem_copy()
-        self.loc._setitem_with_indexer(key, value)
+        self.loc[key] = value

jorisvandenbossche on 31 Jan 2020

👍1

Thanks for investigating @jorisvandenbossche

amueller on 31 Jan 2020

BTW, I think this is a rather serious regression, since it doesn't give an error, but rather silently modifies/corrupts your data, and thus can silently lead to wrong results. We should probably try to do a 1.0.1 quickly.

jorisvandenbossche on 31 Jan 2020

Agreed. I won't be able to this weekend, but perhaps Monday?

I'm hoping to fix up a bunch of the reported regressions today.