import pandas as pd
index = pd.MultiIndex(levels=[['A','B','C','D','E']] * 2, labels=[[4,4,4,3], [4,2,0,1]])
pd.Series(0, index).unstack()
In Pandas 0.20, 0.21, and 0.22, this gave the expected result:
A B C E
D NaN 0.0 NaN NaN
E 0.0 NaN 0.0 0.0
But in Pandas 0.23, the result is not sorted:
E C A B
E 0.0 0.0 0.0 NaN
D NaN NaN NaN 0.0
The documentation says "The level involved will automatically get sorted", and while I've seen the explanation of confusing implementation details leaking out in #15105 and some other outright bugs in #9514, this seems to be a different bug, and a regression.
pd.show_versions()commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Linux
machine: x86_64
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.1
numpy: 1.13.1
As mentioned in the first linked comment there's not any explicit ordering going on with an unstack operation and I think it's preferable NOT to do that (i.e. previous behavior was wrong). You as the end user can easily sort after the fact if you wanted to, but wouldn't as easily be able to maintain the original ordering if this operation implicitly sorted for you.
I certainly see where the documentation is misleading on that though - any interest in submitting a PR to update the docs and add a test case to ensure this behavior doesn't regress going forward?
@WillAyd, #15105 asks for a new sort option to be added to unstack. The documented, specified, and actual behaviors have for many years been that it sorts the index. Suddenly in Pandas 0.23, the behavior changed, without a FutureWarning, without a documentation change, and without a mention in #15105, which suggests the change was an accident. Significant functional changes should be made deliberately, with discussion, not slipped in without notice and then documented a few releases later.
I suggest that the longstanding and clearly documented behavior (sorted unstack) should be restored, and then #15105 can continue to explore new ideas such as adding an option to not sort. If the default behavior is to be changed, a FutureWarning could be used to help users transition.
You have marked this as a Docs issue. But it is a regression in Pandas 0.23, and a functional bug in the code, not a cosmetic one in the docs.
Hi, @jzwinck @WillAyd
I tried to dig into this issue and realized the following:
1) actual behaviour of unstack() was not to sort, it was to keep order consistent with declaration of levels, e.g. pandas 0.22.0
import pandas as pd
index = pd.MultiIndex(levels=[['B','A','C','D','E']] * 2, labels=[[4,4,4,3], [4,2,0,1]])
pd.Series(0, index).unstack()
will result in
B A C E
D NaN 0.0 NaN NaN
E 0.0 NaN 0.0 0.0
2) in the previous version empty columns were removed in the _Unstacker.get_result()
# filter out missing levels
if values.shape[1] > 0:
col_inds, obs_ids = compress_group_index(self.sorted_labels[-1])
# rare case, level values not observed
if len(obs_ids) < self.full_shape[1]:
inds = (value_mask.sum(0) > 0).nonzero()[0]
values = algos.take_nd(values, inds, axis=1)
columns = columns[inds]
columns order has been kept consistent, but in pandas 0.23.0 empty columns removing is a part of _Unstacker.__init___()
self.index = index.remove_unused_levels()
and here is the issue: this method will place rows and columns in the order of provided labels, e.g.
import pandas as pd
index = pd.MultiIndex(levels=[['A','B','C','D','E']] * 2, labels=[[4,4,3,4], [0,1,2,4]])
s = pd.Series(0, index).unstack()
will result in
A B C E
E 0.0 0.0 NaN 0.0
D NaN NaN 0.0 NaN
rows are not consistent any more because labels[0] starts from 4, but columns looks sorted because labels[1] is sorted.
I think that this behaviour is really a regression issue and should be fixed as a bug. Initially this code was introduced in #18460 as a part of bug fix solution. I would like to work on this issue, and I'll appreciate any recommendations. As of now I see the following options:
1) To rollback from using remove_unused_levels(), looks less efficient but more safe, will require to find another way to fix solved bugs.
2) To update remove_unused_levels(), could lead to behavior changes in other cases, however reshape.py is the only file in pandas lib using this method.
@jreback Would you mind weighing in on @deisdenis's question just above? Should we rollback from using remove_unused_levels() to fix this regression, or change its implementation?
a lot has changed since 0.23
try this on master and see
@jreback The behaviour is still the same as 0.23 on master
Hi @deisdenis, I have also looked into this issue. It seems that the function descriptor of remove_unused_levels specifies that the multiIndex order needs to be preserved so no sorting should be done in this function. Maybe an alternative is to add a sort argument for unstack. Are you still interested in this issue? What do you think?
Hi @jiangyue12392, sounds cool, I like your idea! I'll dig into it.
@deisdenis Did you make any progress?
Hey @jzwinck , I'm sorry, but I don't think we need to complicate unstack with an additional parameter after all. I think it's better to keep the initial order and use df.sort_index() explicitly afterward. This approach is also covered by the existing documentation.
Most helpful comment
@WillAyd, #15105 asks for a new
sortoption to be added to unstack. The documented, specified, and actual behaviors have for many years been that it sorts the index. Suddenly in Pandas 0.23, the behavior changed, without a FutureWarning, without a documentation change, and without a mention in #15105, which suggests the change was an accident. Significant functional changes should be made deliberately, with discussion, not slipped in without notice and then documented a few releases later.I suggest that the longstanding and clearly documented behavior (sorted unstack) should be restored, and then #15105 can continue to explore new ideas such as adding an option to not sort. If the default behavior is to be changed, a FutureWarning could be used to help users transition.
You have marked this as a Docs issue. But it is a regression in Pandas 0.23, and a functional bug in the code, not a cosmetic one in the docs.