Pandas: loc() does not swap two rows in multi-index pandas dataframe

Created on 21 Sep 2018 · 7Comments · Source: pandas-dev/pandas

df = pd.DataFrame(np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],
    ['Green', 'Red', 'Green']])

df.loc[['b','a'],:]  # does not swap

# df.reindex(index=['b', 'a'], level=0) # This works

Problem description

df.loc[['b','a'],:] does not swap the rows 'a' and 'b', nor does for a Series.

Expected Output

The output should be the same as the one obtained via df.reindex(index=['b', 'a'], level=0)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.14.0
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
pandas_gbq: None
pandas_datareader: 0.7.0

Indexing MultiIndex

Source

jungwu

Most helpful comment

For a single-indexed table, we can reindex it using .loc() instead of .reindex()

df = pd.DataFrame([[1,2,3],[4,5,6]], index=['a','b'])

df.loc[['b','a']]

Out[9]: 
   0  1  2
b  4  5  6
a  1  2  3

In his book, "Python for Data Analysis 2ed" Wes McKinney says "you can reindex more succinctly by label-indexing with loc, and many users prefer to use it", although he used only a single-index table
for illustration.

This may be not a bug. however it is natural for one to expect the same effect for the multi-index table.

jungwu on 22 Sep 2018

👍2

All 7 comments

I don't think we make guarantees about the order of returned values from a .loc operation so I am inclined to say this is not a bug but let's see what others say

WillAyd on 21 Sep 2018

For a single-indexed table, we can reindex it using .loc() instead of .reindex()

df = pd.DataFrame([[1,2,3],[4,5,6]], index=['a','b'])

df.loc[['b','a']]

Out[9]: 
   0  1  2
b  4  5  6
a  1  2  3

This may be not a bug. however it is natural for one to expect the same effect for the multi-index table.

jungwu on 22 Sep 2018

👍2

I personally think that this inconsistency between single- and multi-index dataframes is dangerous.

In python, a list not only represents a collection of objects, but also sets an ordering of these objects. Not respecting this ordering under certain conditions (single- vs. multi-index) is non-intuitive.

For sure the inconsistency is not emphasized enough in the documentation. I didn't come across a corresponding warning in the article on advanced indexing, neither is there any note in the documentation of .loc

By the way: SO brought me here, see my related SO question.

normanius on 30 Dec 2018

I too got mixed up, by using .loc[mylist] on a multi-index dataframe, it did not preserve the order of _mylist_. I realized it way later than I should have.

Surely it is strange that it preserves the order for single-index, but not multi-index.

Even if it doesn't qualifies as a bug, is it possible to add a warning ?
Only when .loc[key] is used on a multi-index structure, if _key_ is an iterable, just to warn that the order might not be preserved (and maybe recommending using .reindex(index=key, level=0) instead)

I tried to locate in the code where the warning should be written, without success. :(

pilou-K75VJ on 23 Jul 2019

👍1

It's a bug. Not sure where though, so investigation would be welcome.

TomAugspurger on 30 Jul 2019

So, after investigation (thanks pdb), what happened is that we obtain the indexes to return in the order the keys are given, but the way the indexes are stitched together reorder them.

More precisely, the culprit seems to be the __or__ operator in pandas/core/indexes/multi.py:3043, that sort the results.

I succeeded to get the expected result for the given example by using union(...,sort=False) instead, but their is still something to do so that the following work, when multiple level are to be ordered.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],
    ['Green', 'Red', 'Green']])

df.loc[(['b','a'],[2, 1]),:]

# out
     Ohio     Colorado
    Green Red    Green
b 1     6   7        8
  2     9  10       11
a 1     0   1        2
  2     3   4        5

nrebena on 29 Sep 2019

So, working on this issues I was confronted to the following problems: if we want to maintains the order given for each level, how should we treat slice(None)? Example follow:

df = pd.DataFrame(
      np.arange(12).reshape((4, 3)),
      index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
      columns=[["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]],
      )

df.loc[(slice(None), [2,1]), :]                                                                                                
# actual output, same as df, order of slice(None) level take absolut precedence
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

# Second level is prioritary on first level
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
b 2     9  10       11
a 1     0   1        2
b 1     6   7        8

# or
# Keep order of first level, then second level
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
  1     0   1        2
b 2     9  10       11
  1     6   7        8

Ideas are welcome.

nrebena on 16 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings