Pandas: `sort=False` option to stack/unstack/pivot

Created on 11 Jan 2017  路  8Comments  路  Source: pandas-dev/pandas

It would be really nice if there was a sort=False option on stack/unstack and pivot. (Preferably the default)

It is reasonably common to have data in non-standard order that actually provides information (in my case, I have model names, and the order of the names denotes complexity of the models). Stacking or unstacking currently loses all of this information, with no way to retrieve it. That does not seem like a sensible default to me.

It would be relatively easy to work around a non-sorted stack/unstack method (using .sort_index). To go the other way is less trivial, requiring the user to store a list of the values in the necessary order.

I actually find it hard to think of a situation where a sort on unstack would be more useful...

API Design Bug MultiIndex Reshaping

All 8 comments

can you show an example. These are ordered by the index and not sorted.

Indeed, as @jreback points out we don't sort when stacking or unstacking. Rather, levels are sorted internally in a MultiIndex when a MultiIndex is constructed (e.g., with from_arrays or from_product).

This is a confusing implementation detail that leaks into the public API. The levels of a MultiIndex only look sorted if you look at the low-level MultiIndex repr, not the values in a series/dataframe:

In [4]: index = pd.MultiIndex.from_arrays([['b', 'a'], [1, 0]])

In [5]: index
Out[5]:
MultiIndex(levels=[['a', 'b'], [0, 1]],
           labels=[[1, 0], [1, 0]])

In [6]: index.to_series()
Out[6]:
b  1    (b, 1)
a  0    (a, 0)
dtype: object

See https://github.com/pandas-dev/pandas/issues/14903 and https://github.com/pandas-dev/pandas/issues/14672 for related discussion.

I see a few alternatives for cleaning this up:

  1. Add a sort_levels=False argument to from_arrays and from_product, to allow not sorting levels at MultiIndex construction time. The downside is that this means that some MultiIndex indexing operations will be slow, which defeats part of the purpose of a MultiIndex, but it could still be nice to have the option.
  2. Add a sort=False or reorder=False option (maybe make this the default?) to stack/unstack/pivot, which would ensure that unstacked columns appear in order of appearance in the MultiIndex rather than sorted order (as you suggest here).
  3. Alternatively, we could do nothing and encourage using Categorical dtype to preserve level order, e.g., use pd.Categorical(['b', 'a'], categories=['b', 'a']) rather than just ['b', 'a']. But this is pretty cumbersome and doesn't make the default behavior any more intuitive.

2 - that would basically imply that the stack/unstack/pivot operation would record the order of index elements of the input, and re-order the output based on that, is that correct? That would be fine, I think.

2 - that would basically imply that the stack/unstack/pivot operation would record the order of index elements of the input, and re-order the output based on that, is that correct? That would be fine, I think.

Correct, yes. One downside of this approach is that it is slightly slower to construct the new labels. It requires a pass over the full index using pd.unique, rather than just using the unique set of labels.

So this is quite straightforward to provide categorical orderings.
This currently has a small bug, see #15058, but does work (even though indexing generally requires lexsorting, this is compatible). So This is a reasonable soln.

In [31]: index = pd.MultiIndex.from_tuples(list(zip(['a', 'a', 'a', 'b', 'b', 'b'],
    ...:                                       [0, 0, 0, 1, 1, 1],
    ...:                                       ['x', 'xx', 'xxx', 'x', 'xx', 'xxx'])),
    ...:                                       names=['A', 'B', 'C'])
    ...: df = pd.DataFrame(np.random.rand(6, 3), index = index)
    ...: 
    ...: 

In [32]: df
Out[32]: 
                0         1         2
A B C                                
a 0 x    0.179126  0.320740  0.048524
    xx   0.448321  0.755726  0.685202
    xxx  0.058184  0.984778  0.432928
b 1 x    0.279140  0.876715  0.249856
    xx   0.185775  0.403420  0.729933
    xxx  0.133858  0.652050  0.960392

In [33]: df = df.reset_index()

In [34]: df = df.assign(C=df.C.astype('category', categories=['xxx', 'xx', 'x']))

In [35]: df
Out[35]: 
   A  B    C         0         1         2
0  a  0    x  0.179126  0.320740  0.048524
1  a  0   xx  0.448321  0.755726  0.685202
2  a  0  xxx  0.058184  0.984778  0.432928
3  b  1    x  0.279140  0.876715  0.249856
4  b  1   xx  0.185775  0.403420  0.729933
5  b  1  xxx  0.133858  0.652050  0.960392

In [36]: df.dtypes
Out[36]: 
A      object
B       int64
C    category
0     float64
1     float64
2     float64
dtype: object

In [37]: df.sort_values(['A', 'B', 'C'])
Out[37]: 
   A  B    C         0         1         2
2  a  0  xxx  0.058184  0.984778  0.432928
1  a  0   xx  0.448321  0.755726  0.685202
0  a  0    x  0.179126  0.320740  0.048524
5  b  1  xxx  0.133858  0.652050  0.960392
4  b  1   xx  0.185775  0.403420  0.729933
3  b  1    x  0.279140  0.876715  0.249856

The point is that stack/unstack really sorts the the index. Let's continue with the session by @jreback .

In [7]: df = df.sort_values(['A', 'B', 'C'], ascending=False) 

In [8]: df    
Out[8]: 
   A  B    C         0         1         2
3  b  1    x  0.674635  0.607875  0.356187
4  b  1   xx  0.062986  0.675858  0.895372
5  b  1  xxx  0.974950  0.448647  0.178389
0  a  0    x  0.683399  0.652522  0.549847
1  a  0   xx  0.168813  0.891502  0.385490
2  a  0  xxx  0.881862  0.143344  0.867280

In [9]: df = df.set_index(['A', 'B', 'C'])  

In [10]: df  
Out[10]: 
                0         1         2
A B C                                
b 1 x    0.674635  0.607875  0.356187
    xx   0.062986  0.675858  0.895372
    xxx  0.974950  0.448647  0.178389
a 0 x    0.683399  0.652522  0.549847
    xx   0.168813  0.891502  0.385490
    xxx  0.881862  0.143344  0.867280

In [11]: df.index     
Out[11]: 
MultiIndex(levels=[['a', 'b'], [0, 1], ['xxx', 'xx', 'x']],
           codes=[[1, 1, 1, 0, 0, 0], [1, 1, 1, 0, 0, 0], [2, 1, 0, 2, 1, 0]],
           names=['A', 'B', 'C'])

Now, this is the bug mentioned above. Never mind this, let's continue.

In [12]: df = df.unstack()  

In [13]: df    
Out[13]: 
            0                             1                             2                    
C         xxx        xx         x       xxx        xx         x       xxx        xx         x
A B                                                                                          
a 0  0.881862  0.168813  0.683399  0.143344  0.891502  0.652522  0.867280  0.385490  0.549847
b 1  0.974950  0.062986  0.674635  0.448647  0.675858  0.607875  0.178389  0.895372  0.356187

In [14]: df.index  
Out[14]: 
MultiIndex(levels=[['a', 'b'], [0, 1]],
           codes=[[0, 1], [0, 1]],
           names=['A', 'B'])

Now the data frame is sorted by the remaining levels in the index!

This is undocumented behavior. The documentation says only: "The level involved will automatically get sorted."

Also requested in multiple SO questions.

This is quite aggravating with column multiindex. It doesn't seem like there is a work around.

Firstly,there doesn't seem to be an easy way to convert an existing multindex into a categorical one. The best I could come up with is to select a df row, reset_index on the resultant series, and manually create a categorical multi-index index out of all the relevant columns, and then re-assign that to the original. This works, but it would be nice if there was an easy way to do something like df.columns['blah'] = pd.CategoricalIndex(df.columns['blah'], ordered=True) or something.

Secondly, even with a categorical column multi index, the categories are completely ignored. For example, I would like to stack the 'year' level of the columns, and retain the order of the other levels, but the Categoricalness gets lost in transit:

In [1]: final_df.columns                                                                                  
Out[1]: 
MultiIndex([('value',       'sum_risk_cost', 2020, 'national'),
            ('value',       'sum_risk_cost', 2100, 'national'),
            ('value',   'avg_risk_fraction', 2020, 'national'),
            ('value',   'avg_risk_fraction', 2100, 'national'),
            ('value',   'count_uninsurable', 2020, 'national'),
            ('value',   'count_uninsurable', 2100, 'national'),
            ('value', 'percent_uninsurable', 2020, 'national'),
            ('value', 'percent_uninsurable', 2100, 'national'),
            ( 'rank',       'sum_risk_cost', 2020, 'national'),
            ( 'rank',       'sum_risk_cost', 2020,  'nat_10k'),
            ( 'rank',       'sum_risk_cost', 2020,    'state'),
            ( 'rank',       'sum_risk_cost', 2100, 'national'),
            ( 'rank',       'sum_risk_cost', 2100,  'nat_10k'),
            ( 'rank',       'sum_risk_cost', 2100,    'state'),
            ( 'rank',   'avg_risk_fraction', 2020, 'national'),
            ( 'rank',   'avg_risk_fraction', 2020,  'nat_10k'),
            ( 'rank',   'avg_risk_fraction', 2020,    'state'),
            ( 'rank',   'avg_risk_fraction', 2100, 'national'),
            ( 'rank',   'avg_risk_fraction', 2100,  'nat_10k'),
            ( 'rank',   'avg_risk_fraction', 2100,    'state'),
            ( 'rank',   'count_uninsurable', 2020, 'national'),
            ( 'rank',   'count_uninsurable', 2020,  'nat_10k'),
            ( 'rank',   'count_uninsurable', 2020,    'state'),
            ( 'rank',   'count_uninsurable', 2100, 'national'),
            ( 'rank',   'count_uninsurable', 2100,  'nat_10k'),
            ( 'rank',   'count_uninsurable', 2100,    'state'),
            ( 'rank', 'percent_uninsurable', 2020, 'national'),
            ( 'rank', 'percent_uninsurable', 2020,  'nat_10k'),
            ( 'rank', 'percent_uninsurable', 2020,    'state'),
            ( 'rank', 'percent_uninsurable', 2100, 'national'),
            ( 'rank', 'percent_uninsurable', 2100,  'nat_10k'),
            ( 'rank', 'percent_uninsurable', 2100,    'state')],
           names=['type', 'stat', 'year', 'subset'])

In [2]: final_df.columns.get_level_values('type')                                                         
Out[2]: 
CategoricalIndex(['value', 'value', 'value', 'value', 'value', 'value',
                  'value', 'value', 'rank', 'rank', 'rank', 'rank', 'rank',
                  'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
                  'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
                  'rank', 'rank', 'rank', 'rank', 'rank'],
                 categories=['rank', 'value'], ordered=True, name='type', dtype='category')

In [3]: final_df = final_df.stack('year').sort_index()                                                    

In [4]: final_df.columns.get_level_values('type')                                                         
Out[4]: 
Index(['rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank', 'rank',
       'rank', 'rank', 'rank', 'value', 'value', 'value', 'value'],
      dtype='object', name='type')

It also doesn't seem easy to manually store the column orders and re-use them afterwards, die to the missing 'year' level.

Anyone have a suggested work-around for this?

Was this page helpful?
0 / 5 - 0 ratings