Pandas: REGR: __setitem__ with integer slices on Int/RangeIndex is broken (label instead of positional)

Created on 30 Jan 2020  路  15Comments  路  Source: pandas-dev/pandas

There's an backward incompatible change in pandas 1.0 that I didn't find in the changelog. I might have just overlooked it though.

import numpy as np
X = pd.DataFrame(np.zeros((100, 1)))
X[-4:] = 1
X

In pandas 0.25.3 or lower, this results in the last four entries of X to be 1 and all the others zero. In pandas 1.0, it results in all entries of X being 1.
I assume it's a change of indexing axis 0 or axis 1?

Indexing Regression

All 15 comments

I wonder if it's related to #31449 but I'm not using a multi-index.

Thanks for the report.

Seems this doesn't affect .iloc:

In [26]: import numpy as np 
    ...: X = pd.DataFrame(np.zeros((5, 1))) 
    ...: X.iloc[-4:] = 1 
    ...: X                                                                      
Out[26]: 
     0
0  0.0
1  1.0
2  1.0
3  1.0
4  1.0

will look into it

you are label indexing with a slice with loc
since none of the labels exist nothing is set

did this actually work previously?

this should never have worked with .loc

it might have with [] which has fallback integer indexing

I do not remember any specific discussion about this, so I think it is definitely a regression.

Slicing rows in [] has always worked positional if there is an integer index (surprising, yes, but longstanding behaviour, see eg my summary of this of 5 years ago https://github.com/pandas-dev/pandas/issues/9595)

since none of the labels exist nothing is set

@jreback if I've understood correctly, the issue is that _everything_ is being set

>>> import numpy as np
>>> import pandas as pd

>>> X = pd.DataFrame(np.zeros((5, 1)))
>>> X                                                                     
     0
0  0.0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:]  # only prints the last 4 rows of X...
     0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:] = 1
>>> X  # ...but everything (including the first row) has now been set
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0

I do not remember any specific discussion about this, so I think it is definitely a regression.

Slicing rows in [] has always worked positional if there is an integer index (surprising, yes, but longstanding behaviour, see eg my summary of this of 5 years ago #9595)

maybe but indexing with an out or range label on both sides should return nothing

so the results are correct

since none of the labels exist nothing is set

@jreback if I've understood correctly, the issue is that _everything_ is being set

>>> import numpy as np
>>> import pandas as pd

>>> X = pd.DataFrame(np.zeros((5, 1)))
>>> X                                                                     
     0
0  0.0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:]  # only prints the last 4 rows of X...
     0
1  0.0
2  0.0
3  0.0
4  0.0

>>> X[-4:] = 1
>>> X  # ...but everything (including the first row) has now been set
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0

ahh ok that is not correct; i would expect this indexer to return noting

ahh ok that is not correct; i would expect this indexer to return noting

Asking for the shape, both in 0.25 and 1.0, you get

>>> X[-4:].shape
(4, 1)

but assignment in version 1.0 assigns to everything.

maybe but indexing with an out or range label on both sides should return nothing

This is about positional indexing, so there is no "out of range label". The -4 means start from the fourth last element to the end.

Again, I agree this is surprising behaviour. You would think it is label-based indexing, but it is not. I already described this 5 years in ago #9595.

Some examples to illustrate this:

In [21]: df = pd.DataFrame({'a': [0., 1., 2., 3.]}, index=[2, 3, 4, 5])

In [22]: df 
Out[22]: 
     a
2  0.0
3  1.0
4  2.0
5  3.0

In [23]: df[2:] 
Out[23]: 
     a
4  2.0
5  3.0

In [24]: df[:3]  
Out[24]: 
     a
2  0.0
3  1.0
4  2.0

This those examples are for __getitem__, and work clearly positionally if you look at the index of the results (and both on 0.25 and 1.0, and for both Int64Index as RangeIndex).
And so it is __setitem__ is broken in 1.0.0.

This is caused by https://github.com/pandas-dev/pandas/pull/27383 I think (cc @jbrockmendel ), specifically:

     def _setitem_slice(self, key, value):
         self._check_setitem_copy()
-        self.loc._setitem_with_indexer(key, value)
+        self.loc[key] = value

Thanks for investigating @jorisvandenbossche

BTW, I think this is a rather serious regression, since it doesn't give an error, but rather silently modifies/corrupts your data, and thus can silently lead to wrong results. We should probably try to do a 1.0.1 quickly.

Agreed. I won't be able to this weekend, but perhaps Monday?

I'm hoping to fix up a bunch of the reported regressions today.

I'll start a branch reverting the lines @jorisvandenbossche identified and open a PR after confirming that fixes this.

After this is fixed for 1.0.1, we should discuss deprecating the surprising behavior.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andreas-thomik picture andreas-thomik  路  3Comments

tade0726 picture tade0726  路  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  路  3Comments

ebran picture ebran  路  3Comments

nathanielatom picture nathanielatom  路  3Comments