Pandas: BUG: unstack() does not always sort index in 0.23

Created on 29 Jun 2018 · 10Comments · Source: pandas-dev/pandas

Code Sample

import pandas as pd
index = pd.MultiIndex(levels=[['A','B','C','D','E']] * 2, labels=[[4,4,4,3], [4,2,0,1]])
pd.Series(0, index).unstack()

Problem description

In Pandas 0.20, 0.21, and 0.22, this gave the expected result:

     A    B    C    E
D  NaN  0.0  NaN  NaN
E  0.0  NaN  0.0  0.0

But in Pandas 0.23, the result is not sorted:

     E    C    A    B
E  0.0  0.0  0.0  NaN
D  NaN  NaN  NaN  0.0

The documentation says "The level involved will automatically get sorted", and while I've seen the explanation of confusing implementation details leaking out in #15105 and some other outright bugs in #9514, this seems to be a different bug, and a regression.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Linux
machine: x86_64
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.1
numpy: 1.13.1

Docs Testing good first issue

Source

jzwinck

Most helpful comment

@WillAyd, #15105 asks for a new sort option to be added to unstack. The documented, specified, and actual behaviors have for many years been that it sorts the index. Suddenly in Pandas 0.23, the behavior changed, without a FutureWarning, without a documentation change, and without a mention in #15105, which suggests the change was an accident. Significant functional changes should be made deliberately, with discussion, not slipped in without notice and then documented a few releases later.

I suggest that the longstanding and clearly documented behavior (sorted unstack) should be restored, and then #15105 can continue to explore new ideas such as adding an option to not sort. If the default behavior is to be changed, a FutureWarning could be used to help users transition.

You have marked this as a Docs issue. But it is a regression in Pandas 0.23, and a functional bug in the code, not a cosmetic one in the docs.

jzwinck on 30 Jun 2018

👍2

All 10 comments

As mentioned in the first linked comment there's not any explicit ordering going on with an unstack operation and I think it's preferable NOT to do that (i.e. previous behavior was wrong). You as the end user can easily sort after the fact if you wanted to, but wouldn't as easily be able to maintain the original ordering if this operation implicitly sorted for you.

I certainly see where the documentation is misleading on that though - any interest in submitting a PR to update the docs and add a test case to ensure this behavior doesn't regress going forward?

WillAyd on 30 Jun 2018

You have marked this as a Docs issue. But it is a regression in Pandas 0.23, and a functional bug in the code, not a cosmetic one in the docs.

jzwinck on 30 Jun 2018

👍2

Hi, @jzwinck @WillAyd
I tried to dig into this issue and realized the following:
1) actual behaviour of unstack() was not to sort, it was to keep order consistent with declaration of levels, e.g. pandas 0.22.0

import pandas as pd
index = pd.MultiIndex(levels=[['B','A','C','D','E']] * 2, labels=[[4,4,4,3], [4,2,0,1]])
pd.Series(0, index).unstack()

will result in

     B    A    C    E
D  NaN  0.0  NaN  NaN
E  0.0  NaN  0.0  0.0

2) in the previous version empty columns were removed in the _Unstacker.get_result()

        # filter out missing levels
        if values.shape[1] > 0:
            col_inds, obs_ids = compress_group_index(self.sorted_labels[-1])
            # rare case, level values not observed
            if len(obs_ids) < self.full_shape[1]:
                inds = (value_mask.sum(0) > 0).nonzero()[0]
                values = algos.take_nd(values, inds, axis=1)
                columns = columns[inds]

columns order has been kept consistent, but in pandas 0.23.0 empty columns removing is a part of _Unstacker.__init___()

self.index = index.remove_unused_levels()

and here is the issue: this method will place rows and columns in the order of provided labels, e.g.

import pandas as pd
index = pd.MultiIndex(levels=[['A','B','C','D','E']] * 2, labels=[[4,4,3,4], [0,1,2,4]])
s = pd.Series(0, index).unstack()

will result in

     A    B    C    E
E  0.0  0.0  NaN  0.0
D  NaN  NaN  0.0  NaN

rows are not consistent any more because labels[0] starts from 4, but columns looks sorted because labels[1] is sorted.

I think that this behaviour is really a regression issue and should be fixed as a bug. Initially this code was introduced in #18460 as a part of bug fix solution. I would like to work on this issue, and I'll appreciate any recommendations. As of now I see the following options:
1) To rollback from using remove_unused_levels(), looks less efficient but more safe, will require to find another way to fix solved bugs.
2) To update remove_unused_levels(), could lead to behavior changes in other cases, however reshape.py is the only file in pandas lib using this method.

deisdenis on 3 Aug 2018

@jreback Would you mind weighing in on @deisdenis's question just above? Should we rollback from using remove_unused_levels() to fix this regression, or change its implementation?

jzwinck on 30 May 2019

a lot has changed since 0.23
try this on master and see

jreback on 30 May 2019

@jreback The behaviour is still the same as 0.23 on master

jiangyue12392 on 26 Jun 2019

Hi @deisdenis, I have also looked into this issue. It seems that the function descriptor of remove_unused_levels specifies that the multiIndex order needs to be preserved so no sorting should be done in this function. Maybe an alternative is to add a sort argument for unstack. Are you still interested in this issue? What do you think?

jiangyue12392 on 27 Jun 2019

👍1

Hi @jiangyue12392, sounds cool, I like your idea! I'll dig into it.

deisdenis on 28 Jun 2019

@deisdenis Did you make any progress?

jzwinck on 28 Apr 2020

Hey @jzwinck , I'm sorry, but I don't think we need to complicate unstack with an additional parameter after all. I think it's better to keep the initial order and use df.sort_index() explicitly afterward. This approach is also covered by the existing documentation.

deisdenis on 9 May 2020

Was this page helpful?

0 / 5 - 0 ratings