Pandas: BUG: KeyError on MultiIndex when filtering by only two indexes

Created on 7 Jul 2020  路  7Comments  路  Source: pandas-dev/pandas

  • [x] I have checked that this issue has not already been reported.

  • [x] I have confirmed this bug exists on the latest version of pandas.

  • [ ] (optional) I have confirmed this bug exists on the master branch of pandas.

Code

import pandas as pd
df = pd.DataFrame({'col1':[1,2,1,2],'col2':['a','a','b','b'],'col3':['US','US','US','BR'],'values':[1.0,2.0,3.0,4.5]})
df.set_index(['col1','col2','col3'],inplace = True)
df.loc[([1],['a'])]

```python-traceback
Traceback (most recent call last):
File "", line 1, in
File "/home/fabio.rangel/projects/audience_topic_viewership/.direnv/python-3.7.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 1762, in __getitem__
return self._getitem_tuple(key)
File "/home/fabio.rangel/projects/audience_topic_viewership/.direnv/python-3.7.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 1272, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/home/fabio.rangel/projects/audience_topic_viewership/.direnv/python-3.7.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 1373, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/home/fabio.rangel/projects/audience_topic_viewership/.direnv/python-3.7.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 1453, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/home/fabio.rangel/projects/audience_topic_viewership/.direnv/python-3.7.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 1954, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/home/fabio.rangel/projects/audience_topic_viewership/.direnv/python-3.7.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 1595, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
File "/home/fabio.rangel/projects/audience_topic_viewership/.direnv/python-3.7.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 1553, in _get_listlike_indexer
keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
File "/home/fabio.rangel/projects/audience_topic_viewership/.direnv/python-3.7.7/lib/python3.7/site-packages/pandas/core/indexing.py", line 1640, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['a'], dtype='object')] are in the [columns]"

#### Problem description

The problem occurs only when filtering using MultiIndex with a two-sized tuple. When using one, three or more it does not occurs. I tested using up to four indexes. 

#### Expected Output
```sh
                values
col1 col2 col3        
1    a    US       1.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-174-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : pt_BR.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

MultiIndex Usage Question

All 7 comments

Does df.loc[(1, 'a')] work for you?

I'm shocked to find that df.loc[([1], ['a'], ['US'])] successfully gets the first row.

Does df.loc[(1, 'a')] work for you?

I'm shocked to find that df.loc[([1], ['a'], ['US'])] successfully gets the first row.

When using loc with lists it remains the dataframe structure with the MultiIndex in the return. And I can filter by many items at the same time in the same list.

Same structure in the return:

df.loc[(1, 'a')]
      values
col3        
US       1.0



md5-687143418fdb01c1750b083340dea5ef



```sh
                values
col1 col2 col3        
1    a    US       1.0



md5-97ffec37004a19662d86396ad7d9e5cf



```sh
                values
col1 col2 col3        
1    a    US       1.0
2    a    US       2.0

The most interesting about this bug is that it works with 3 lists, 1 list, 4 lists... but with 2 lists it raises a KeyError. So df.loc[([1,2], ['a','b'], ['US'])] works, but df.loc[([1,2], ['a','b'])] does not.

I'm not sure if this is a bug. to be explicit when partial indexing with list-likes use slice or pd.IndexSlice see https://pandas.pydata.org/docs/user_guide/advanced.html#using-slicers

from the docs

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. There are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MultiIndex for the rows.

>>> # use the built-in slice
>>> df.loc[([1], ["a"], slice(None))]
                values
col1 col2 col3
1    a    US       1.0
>>>
>>> # use the IndexSlice class
>>> idx = pd.IndexSlice
>>> df.loc[(idx[[1], ["a"]],)]
                values
col1 col2 col3
1    a    US       1.0
>>>
>>> # or
>>> df.loc[idx[[1], ["a"], :]]
                values
col1 col2 col3
1    a    US       1.0
>>>
>>>
>>> # or even just in this case 
>>> df.loc[([1], ["a"]), :]
                values
col1 col2 col3
1    a    US       1.0
>>>

I'm not sure if this is a bug. to be explicit when partial indexing with list-likes use slice or pd.IndexSlice see https://pandas.pydata.org/docs/user_guide/advanced.html#using-slicers

That is a clever solution. However, assuming it is not a bug, the same behavior should appear in the following code:

import pandas as pd
df = pd.DataFrame({
                    'col1':[1,2,1,2],
                    'col2':['a','a','b','b'],
                    'col3':['US','US','US','BR'],
                    'col4':['3','4','3','3'],
                    'values':[1.0,2.0,3.0,4.5]
                })
df.set_index(['col1','col2','col3','col4'],inplace = True)
df.loc[([1,2], ['a'],['US','BR'])]

But this code works.
I still think there is something wrong when applying loc with 2 lists in the tupple.

I still think there is something wrong when applying loc with 2 lists in the tupple.

because of the way the getitem operator, i.e. square brackets works in Python, the key is always a tuple when comma seperated values are passed.

>>> class MyClass:
...     def __getitem__(self, key):
...         print(key)
...
>>>
>>>
>>> MyClass()[1, 1]
(1, 1)
>>>
>>> MyClass()[(1, 1)]
(1, 1)
>>>

so df.loc[([1,2], ['a'])] is equivalent to df.loc[[1,2], ['a']]

>>> MyClass()[([1, 2], ["a"])]
([1, 2], ['a'])
>>>
>>> MyClass()[[1, 2], ["a"]]
([1, 2], ['a'])
>>>

so the indexing becomes ambiguous. Should the second value be the column indexer or index on the second level of a multiIndex?

Hence the need to be explicit in the ambiguous cases.

>>> # The other alternative is to specify the axis
>>> df.loc(axis=0)[([1], ["a"])]
                values
col1 col2 col3
1    a    US       1.0
>>>
>>> # or add a comma after the tuple
>>> df.loc[([1], ["a"]),]
                values
col1 col2 col3
1    a    US       1.0
>>>

Thank you @simonjayhawkins ! That solves the question for me.

Was this page helpful?
0 / 5 - 0 ratings