It would be really nice if there was a sort=False option on stack/unstack and pivot. (Preferably the default)
It is reasonably common to have data in non-standard order that actually provides information (in my case, I have model names, and the order of the names denotes complexity of the models). Stacking or unstacking currently loses all of this information, with no way to retrieve it. That does not seem like a sensible default to me.
It would be relatively easy to work around a non-sorted stack/unstack method (using .sort_index). To go the other way is less trivial, requiring the user to store a list of the values in the necessary order.
I actually find it hard to think of a situation where a sort on unstack would be more useful...
can you show an example. These are ordered by the index and not sorted.
Indeed, as @jreback points out we don't sort when stacking or unstacking. Rather, levels are sorted internally in a MultiIndex when a MultiIndex is constructed (e.g., with from_arrays or from_product).
This is a confusing implementation detail that leaks into the public API. The levels of a MultiIndex only look sorted if you look at the low-level MultiIndex repr, not the values in a series/dataframe:
In [4]: index = pd.MultiIndex.from_arrays([['b', 'a'], [1, 0]])
In [5]: index
Out[5]:
MultiIndex(levels=[['a', 'b'], [0, 1]],
labels=[[1, 0], [1, 0]])
In [6]: index.to_series()
Out[6]:
b 1 (b, 1)
a 0 (a, 0)
dtype: object
See https://github.com/pandas-dev/pandas/issues/14903 and https://github.com/pandas-dev/pandas/issues/14672 for related discussion.
I see a few alternatives for cleaning this up:
sort_levels=False argument to from_arrays and from_product, to allow not sorting levels at MultiIndex construction time. The downside is that this means that some MultiIndex indexing operations will be slow, which defeats part of the purpose of a MultiIndex, but it could still be nice to have the option.sort=False or reorder=False option (maybe make this the default?) to stack/unstack/pivot, which would ensure that unstacked columns appear in order of appearance in the MultiIndex rather than sorted order (as you suggest here).Categorical dtype to preserve level order, e.g., use pd.Categorical(['b', 'a'], categories=['b', 'a']) rather than just ['b', 'a']. But this is pretty cumbersome and doesn't make the default behavior any more intuitive.2 - that would basically imply that the stack/unstack/pivot operation would record the order of index elements of the input, and re-order the output based on that, is that correct? That would be fine, I think.
2 - that would basically imply that the stack/unstack/pivot operation would record the order of index elements of the input, and re-order the output based on that, is that correct? That would be fine, I think.
Correct, yes. One downside of this approach is that it is slightly slower to construct the new labels. It requires a pass over the full index using pd.unique, rather than just using the unique set of labels.
There's a good example of this here, I think: https://stackoverflow.com/questions/28686053/pandas-pivot-table-reoganize-order-of-multi-index
So this is quite straightforward to provide categorical orderings.
This currently has a small bug, see #15058, but does work (even though indexing generally requires lexsorting, this is compatible). So This is a reasonable soln.
In [31]: index = pd.MultiIndex.from_tuples(list(zip(['a', 'a', 'a', 'b', 'b', 'b'],
...: [0, 0, 0, 1, 1, 1],
...: ['x', 'xx', 'xxx', 'x', 'xx', 'xxx'])),
...: names=['A', 'B', 'C'])
...: df = pd.DataFrame(np.random.rand(6, 3), index = index)
...:
...:
In [32]: df
Out[32]:
0 1 2
A B C
a 0 x 0.179126 0.320740 0.048524
xx 0.448321 0.755726 0.685202
xxx 0.058184 0.984778 0.432928
b 1 x 0.279140 0.876715 0.249856
xx 0.185775 0.403420 0.729933
xxx 0.133858 0.652050 0.960392
In [33]: df = df.reset_index()
In [34]: df = df.assign(C=df.C.astype('category', categories=['xxx', 'xx', 'x']))
In [35]: df
Out[35]:
A B C 0 1 2
0 a 0 x 0.179126 0.320740 0.048524
1 a 0 xx 0.448321 0.755726 0.685202
2 a 0 xxx 0.058184 0.984778 0.432928
3 b 1 x 0.279140 0.876715 0.249856
4 b 1 xx 0.185775 0.403420 0.729933
5 b 1 xxx 0.133858 0.652050 0.960392
In [36]: df.dtypes
Out[36]:
A object
B int64
C category
0 float64
1 float64
2 float64
dtype: object
In [37]: df.sort_values(['A', 'B', 'C'])
Out[37]:
A B C 0 1 2
2 a 0 xxx 0.058184 0.984778 0.432928
1 a 0 xx 0.448321 0.755726 0.685202
0 a 0 x 0.179126 0.320740 0.048524
5 b 1 xxx 0.133858 0.652050 0.960392
4 b 1 xx 0.185775 0.403420 0.729933
3 b 1 x 0.279140 0.876715 0.249856
The point is that stack/unstack really sorts the the index. Let's continue with the session by @jreback .
In [7]: df = df.sort_values(['A', 'B', 'C'], ascending=False)
In [8]: df
Out[8]:
A B C 0 1 2
3 b 1 x 0.674635 0.607875 0.356187
4 b 1 xx 0.062986 0.675858 0.895372
5 b 1 xxx 0.974950 0.448647 0.178389
0 a 0 x 0.683399 0.652522 0.549847
1 a 0 xx 0.168813 0.891502 0.385490
2 a 0 xxx 0.881862 0.143344 0.867280
In [9]: df = df.set_index(['A', 'B', 'C'])
In [10]: df
Out[10]:
0 1 2
A B C
b 1 x 0.674635 0.607875 0.356187
xx 0.062986 0.675858 0.895372
xxx 0.974950 0.448647 0.178389
a 0 x 0.683399 0.652522 0.549847
xx 0.168813 0.891502 0.385490
xxx 0.881862 0.143344 0.867280
In [11]: df.index
Out[11]:
MultiIndex(levels=[['a', 'b'], [0, 1], ['xxx', 'xx', 'x']],
codes=[[1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [2, 1, 0, 2, 1, 0]],
names=['A', 'B', 'C'])
Now, this is the bug mentioned above. Never mind this, let's continue.
In [12]: df = df.unstack()
In [13]: df
Out[13]:
0 1 2
C xxx xx x xxx xx x xxx xx x
A B
a 0 0.881862 0.168813 0.683399 0.143344 0.891502 0.652522 0.867280 0.385490 0.549847
b 1 0.974950 0.062986 0.674635 0.448647 0.675858 0.607875 0.178389 0.895372 0.356187
In [14]: df.index
Out[14]:
MultiIndex(levels=[['a', 'b'], [0, 1]],
codes=[[0, 1], [0, 1]],
names=['A', 'B'])
Now the data frame is sorted by the remaining levels in the index!
This is undocumented behavior. The documentation says only: "The level involved will automatically get sorted."
This is quite aggravating with column multiindex. It doesn't seem like there is a work around.
Firstly,there doesn't seem to be an easy way to convert an existing multindex into a categorical one. The best I could come up with is to select a df row, reset_index on the resultant series, and manually create a categorical multi-index index out of all the relevant columns, and then re-assign that to the original. This works, but it would be nice if there was an easy way to do something like df.columns['blah'] = pd.CategoricalIndex(df.columns['blah'], ordered=True) or something.
Secondly, even with a categorical column multi index, the categories are completely ignored. For example, I would like to stack the 'year' level of the columns, and retain the order of the other levels, but the Categoricalness gets lost in transit:
In [1]: final_df.columns
Out[1]:
MultiIndex([('value', 'sum_risk_cost', 2020, 'national'),
('value', 'sum_risk_cost', 2100, 'national'),
('value', 'avg_risk_fraction', 2020, 'national'),
('value', 'avg_risk_fraction', 2100, 'national'),
('value', 'count_uninsurable', 2020, 'national'),
('value', 'count_uninsurable', 2100, 'national'),
('value', 'percent_uninsurable', 2020, 'national'),
('value', 'percent_uninsurable', 2100, 'national'),
( 'rank', 'sum_risk_cost', 2020, 'national'),
( 'rank', 'sum_risk_cost', 2020, 'nat_10k'),
( 'rank', 'sum_risk_cost', 2020, 'state'),
( 'rank', 'sum_risk_cost', 2100, 'national'),
( 'rank', 'sum_risk_cost', 2100, 'nat_10k'),
( 'rank', 'sum_risk_cost', 2100, 'state'),
( 'rank', 'avg_risk_fraction', 2020, 'national'),
( 'rank', 'avg_risk_fraction', 2020, 'nat_10k'),
( 'rank', 'avg_risk_fraction', 2020, 'state'),
( 'rank', 'avg_risk_fraction', 2100, 'national'),
( 'rank', 'avg_risk_fraction', 2100, 'nat_10k'),
( 'rank', 'avg_risk_fraction', 2100, 'state'),
( 'rank', 'count_uninsurable', 2020, 'national'),
( 'rank', 'count_uninsurable', 2020, 'nat_10k'),
( 'rank', 'count_uninsurable', 2020, 'state'),
( 'rank', 'count_uninsurable', 2100, 'national'),
( 'rank', 'count_uninsurable', 2100, 'nat_10k'),
( 'rank', 'count_uninsurable', 2100, 'state'),
( 'rank', 'percent_uninsurable', 2020, 'national'),
( 'rank', 'percent_uninsurable', 2020, 'nat_10k'),
( 'rank', 'percent_uninsurable', 2020, 'state'),
( 'rank', 'percent_uninsurable', 2100, 'national'),
( 'rank', 'percent_uninsurable', 2100, 'nat_10k'),
( 'rank', 'percent_uninsurable', 2100, 'state')],
names=['type', 'stat', 'year', 'subset'])
In [2]: final_df.columns.get_level_values('type')
Out[2]:
CategoricalIndex(['value', 'value', 'value', 'value', 'value', 'value',
'value', 'value', 'rank', 'rank', 'rank', 'rank', 'rank',
'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
'rank', 'rank', 'rank', 'rank', 'rank'],
categories=['rank', 'value'], ordered=True, name='type', dtype='category')
In [3]: final_df = final_df.stack('year').sort_index()
In [4]: final_df.columns.get_level_values('type')
Out[4]:
Index(['rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
'rank', 'rank', 'rank', 'value', 'value', 'value', 'value'],
dtype='object', name='type')
It also doesn't seem easy to manually store the column orders and re-use them afterwards, die to the missing 'year' level.
Anyone have a suggested work-around for this?