df = pd.DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
df.loc[['b','a'],:] # does not swap
# df.reindex(index=['b', 'a'], level=0) # This works
df.loc[['b','a'],:] does not swap the rows 'a' and 'b', nor does for a Series.
The output should be the same as the one obtained via df.reindex(index=['b', 'a'], level=0)
pd.show_versions()commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.3
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.14.0
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
pandas_gbq: None
pandas_datareader: 0.7.0
I don't think we make guarantees about the order of returned values from a .loc operation so I am inclined to say this is not a bug but let's see what others say
For a single-indexed table, we can reindex it using .loc() instead of .reindex()
df = pd.DataFrame([[1,2,3],[4,5,6]], index=['a','b'])
df.loc[['b','a']]
Out[9]:
0 1 2
b 4 5 6
a 1 2 3
In his book, "Python for Data Analysis 2ed" Wes McKinney says "you can reindex more succinctly by label-indexing with loc, and many users prefer to use it", although he used only a single-index table
for illustration.
This may be not a bug. however it is natural for one to expect the same effect for the multi-index table.
I personally think that this inconsistency between single- and multi-index dataframes is dangerous.
In python, a list not only represents a collection of objects, but also sets an ordering of these objects. Not respecting this ordering under certain conditions (single- vs. multi-index) is non-intuitive.
For sure the inconsistency is not emphasized enough in the documentation. I didn't come across a corresponding warning in the article on advanced indexing, neither is there any note in the documentation of .loc
By the way: SO brought me here, see my related SO question.
I too got mixed up, by using .loc[mylist] on a multi-index dataframe, it did not preserve the order of _mylist_. I realized it way later than I should have.
Surely it is strange that it preserves the order for single-index, but not multi-index.
Even if it doesn't qualifies as a bug, is it possible to add a warning ?
Only when .loc[key] is used on a multi-index structure, if _key_ is an iterable, just to warn that the order might not be preserved (and maybe recommending using .reindex(index=key, level=0) instead)
I tried to locate in the code where the warning should be written, without success. :(
It's a bug. Not sure where though, so investigation would be welcome.
So, after investigation (thanks pdb), what happened is that we obtain the indexes to return in the order the keys are given, but the way the indexes are stitched together reorder them.
More precisely, the culprit seems to be the __or__ operator in pandas/core/indexes/multi.py:3043, that sort the results.
I succeeded to get the expected result for the given example by using union(...,sort=False) instead, but their is still something to do so that the following work, when multiple level are to be ordered.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape((4, 3)),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'],
['Green', 'Red', 'Green']])
df.loc[(['b','a'],[2, 1]),:]
# out
Ohio Colorado
Green Red Green
b 1 6 7 8
2 9 10 11
a 1 0 1 2
2 3 4 5
So, working on this issues I was confronted to the following problems: if we want to maintains the order given for each level, how should we treat slice(None)? Example follow:
df = pd.DataFrame(
np.arange(12).reshape((4, 3)),
index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
columns=[["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]],
)
df.loc[(slice(None), [2,1]), :]
# actual output, same as df, order of slice(None) level take absolut precedence
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
# Second level is prioritary on first level
Ohio Colorado
Green Red Green
a 2 3 4 5
b 2 9 10 11
a 1 0 1 2
b 1 6 7 8
# or
# Keep order of first level, then second level
Ohio Colorado
Green Red Green
a 2 3 4 5
1 0 1 2
b 2 9 10 11
1 6 7 8
Ideas are welcome.
Most helpful comment
For a single-indexed table, we can reindex it using .loc() instead of .reindex()
In his book, "Python for Data Analysis 2ed" Wes McKinney says "you can reindex more succinctly by label-indexing with loc, and many users prefer to use it", although he used only a single-index table
for illustration.
This may be not a bug. however it is natural for one to expect the same effect for the multi-index table.