Pandas: MultiIndex row indexing with .loc fail with tuple but work with list of indices

Created on 15 Jul 2017 · 14Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

data = {"ID1": [1, 1, 1, 2, 2],
        "ID2": [1001, 1001, 1002, 1001, 1002],
        "ID3": [1, 2, 1, 1, 2],
        "Value": [1, 2, 9, 3, 4]}

df = pd.DataFrame(data).set_index(["ID1", "ID2", "ID3"])
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) # the rows to be extracted

print(df)

Out[3]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
    1002 1        9
2   1001 1        3
    1002 2        4

Problem description

Now, extracting the desired rows with loc fails here while returning only the first row:

In [5]: df.loc[desired_rows, :]
Out[5]: 
              Value
ID1 ID2  ID3       
1   1001 2        2

Expected Output

One solution would be to convert the tuple to a list internally because a list of indices work correctly:

In [6]: df.loc[list(desired_rows), :]
Out[6]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
2   1002 2        4

Another solution is to raise an error if a tuple of indices is provided as the row indexer of the loc in order to prevent unpredicted results.

Output of `pd.show_versions()`

In [8]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Indexing MultiIndex

Source

mansenfranzen

Most helpful comment

I could add a statement that tuples are needed in the case of multiple indexers on a multiindex:

Great idea! I think such statement should actually go at the beginning of http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index

You could maybe start by stating that MultiIndex keys take the form of tuples, then you could swap the first two examples currently provided (move the _complete indexing_ one first), then introduce partial indexing, and mention that when doing partial indexing on the first level, you are allowed to only pass the first element of the tuple ('bar' stands for than ('bar',)). Finally, I think a warning box _could_ then clarify that (for the reasons above), tuples and lists are not equivalent in pandas, and in particular, tuples should not be used as lists of keys (for MultiIndexes, and not only).

You _might_ want to show examples of the fact that lists of tuples in general refer to multiple complete (MultiIndex) keys, while tuples of lists in general refer to multiple values on each, that is something like

In [2]: s = pd.Series(-1, index=pd.MultiIndex.from_product([[1, 2], [3, 4]]))

In [3]: s.loc[[(1, 3), (2, 4)]]
Out[3]: 
1  3   -1
2  4   -1
dtype: int64

In [4]: s.loc[([1, 2], [3, 4])]
Out[4]: 
1  3   -1
   4   -1
2  3   -1
   4   -1
dtype: int64

Asides from the possible docs improvements: yes, in some cases we interpret tuples as lists, but I think it should be seen as an undesired implementation legacy. Vice-versa, I see no harm (in general - caveats clearly can apply to specific cases) in interpreting generators, dicts or other list-likes that as lists.

toobaz on 1 Feb 2018

👍2

All 14 comments

One solution would be to convert the tuple to a list

I suspect this would break other tests, were tuples have different meanings that lists for slicing. I may be wrong though, if you want to give it a shot.

TomAugspurger on 15 Jul 2017

this technically is not covered by the doc-string

- A list or array of labels, e.g. ['a', 'b', 'c'].

but we almost always accept array-like (which includes tuples). The reason this is confusing slightly is that a non-nested tuple is also valid as a single indexer.

jreback on 15 Jul 2017

I came across a slightly related issue: using a multi-index dataframe, why can I only use a tuple as an indexer and not a list (i.e. why do they give different results)?

Using the example data, if I want to pull out rows where ID1=1 and ID2=1001, I can only use a tuple inside loc:

df.loc[(1, 1001)]

This returns the desired slice:

ID3       
1        1
2        2

I can't use a list:

 df.loc[[1, 1001]]

This seems to imply that I want values 1 and 1001 for the first level of the index only:

ID1 ID2  ID3       
1   1001 1        1
         2        2
    1002 1        9

It took me quite some time to figure this out. Is this intended behavior? If yes, is this documented (I thought it should be mentioned here but didn't find anything)?

cbrnr on 1 Feb 2018

@cbrnr Yes, that is intended behaviour. For single "labels" of a MultiIndex (so one value for each level), we always use tuples and not a list, because it would otherwise be difficult to distinguish. I think for this case we are quite consistent within pandas.
It is the other way around (in a case where we want list-like, do we accept tuple?) that there can be more discussion. Typically we allfow tuples as list-like, but exactly for the reason above (tuples are used to indicate labels of a MI) we might not want to do that in the case of the original issue here.

So your assessment is correct: it tries to look for those values of the list in the first index level. You could interpret the list as give me the combination of indexing the dataframe with each element of the list, so df.loc[1] and df.loc[1001] -> in both cases you select rows based on the first index level.

jorisvandenbossche on 1 Feb 2018

For the original issue: given the possible confusion between the two, I think it might be better in this case to not interpret the tuple as a list-like.
But, in that case, shouldn't it raise an error? As if we interpret ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) as a single label, it should not find it?

cc @toobaz interesting case :-)

jorisvandenbossche on 1 Feb 2018

Thanks @jorisvandenbossche, this makes sense! I usually don't distinguish between lists and tuples in plain Python since they are both list-like objects. So this Pandas behavior tripped me up a bit - is this documented clearly somewhere?

cbrnr on 1 Feb 2018

But, in that case, shouldn't it raise an error? As if we interpret ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) as a single label, it should not find it?

Ah, no, I suppose this is wrong. It seems that it does interpret ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) as a list, but not as a list of labels, but as a list of lists (a list of indexers into one level).

So it is indexing as:

In [21]: df.loc[pd.IndexSlice[[1, 1001, 1], [1, 1001, 2], [2, 1002, 2]], :]
Out[21]: 
              Value
ID1 ID2  ID3       
1   1001 2        2

To make it a bit more confusing: it is a bit strange however that the actual list of lists (df.loc[[[1, 1001, 1], [1, 1001, 2], [2, 1002, 2]], :]) does not work in this case but raises an error that " '[1, 1001, 1]' is an invalid key". So the list of lists is interpreted as a list of tuples (list of labels).

jorisvandenbossche on 1 Feb 2018

I usually don't distinguish between lists and tuples in plain Python since they are both list-like objects. So this Pandas behavior tripped me up a bit - is this documented clearly somewhere?

Yes, this is one of the gotcha's due to the complexity of MultiIndexing that we somehow need to distinguish between both.
And documentation can certainly better about those things. But in general this is also an area where we would need more extensive testing of the different cases, and then better documentation of those cases (eg see my comment above, even for me it is difficult to really predict how something will be interpreted in certain cases).

jorisvandenbossche on 1 Feb 2018

👍1

I could add a statement that tuples are needed in the case of multiple indexers on a multiindex: http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers. Not a warning box, but maybe there's a note or an info box? Or is there a better place to put such a note? Let me know and I can take care of that in a PR.

cbrnr on 1 Feb 2018

Ah, so we are actually using tuples there in the docs :-) So I just might have had the wrong assumption that a list would work (regarding the last of my comment above https://github.com/pandas-dev/pandas/issues/16943#issuecomment-362250780).
But yes, adding a note there that those multiple indexers need a be contained in a tuple is a good idea (and using the IndexSlice makes it even more explicit)

jorisvandenbossche on 1 Feb 2018

I could add a statement that tuples are needed in the case of multiple indexers on a multiindex:

Great idea! I think such statement should actually go at the beginning of http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index

In [2]: s = pd.Series(-1, index=pd.MultiIndex.from_product([[1, 2], [3, 4]]))

In [3]: s.loc[[(1, 3), (2, 4)]]
Out[3]: 
1  3   -1
2  4   -1
dtype: int64

In [4]: s.loc[([1, 2], [3, 4])]
Out[4]: 
1  3   -1
   4   -1
2  3   -1
   4   -1
dtype: int64

toobaz on 1 Feb 2018

👍2

See #19507

cbrnr on 2 Feb 2018

I think this is fixed by #19507 . Anyone feel free to reopen if you disagree.

toobaz on 3 Aug 2018

@toobaz

This is almost bizarre how you helped me with getting a dataframe by multiindex. Thank you!