Pandas: Slicing with no keys found

Created on 29 Jul 2015 · 28Comments · Source: pandas-dev/pandas

I just upgraded to latest Pandas 0.16, and the new error has hit me (with Pandas 0.15) when slicing with multiple values. (df.ix[[list_of_values]]). I actually think it is more valid to return an empty DataFrame, than to throw an error.

The best I've been able to come up with to reproduce the previous behaviour (fail silently, return empty DataFrame) is:

# I want to filter a DataFrame by the first level of the index
df.ix[df.reset_index().IndexValue.isin(list_of_values).values]

Not saying I'm right on the error/empty argument; but is the above the most elegant solution?

Perhaps we should consider three distinct slicing operations:

Insists on everything being there (and throws errors)
Creates spaces where there is nothing there (reindex)
Just returns what's present, leaving the rest out (silent error)

I would think anyone indexing would be vary aware of what they are expecting from the above?

API Design Indexing Needs Discussion

Source

filmackay

👍2

Most helpful comment

I have a similar issue with respect to the .xs method. I have a DataFrame df1 indexed with 3 levels (A,B,C), and a column 'flow' and another DataFrame df2 that is indexed by 2 levels (A,B) and a column 'balance'. Note that there are values (a,b) in df2 that are not present in df1. For each pair (a,b) in df2, I want to create a new column in df2 with the values that correspond to df1.xs[(a,b)].flow.sum() + df2.loc[(a,b)].balance . If (a,b) is not in df1, I want the sum to be zero, but you still get the balance value. Right now df1.xs[(a,b)].sum() gives a key error, so I end up with some funky code to test if (a,b) is in df1 when computing the sums. If .xs would just return an empty DataFrame if (a,b) was not there, I'd have very clean code.

I can create a new issue for this. I can also provide an example. Let me know if I need to do either.

I would support an idea where adding errors='strict|missing' to .xs would solve the problem.

Dr-Irv on 30 Oct 2015

👍2

All 28 comments

Thanks for the report. Can you attach the script which include the sample data preparation?

sinhrks on 29 Jul 2015

this is discussed tagentially in #10549

this is the discussion is .loc like .reindex Or not. Or somewhere in between (where we are now). If you have a single element at least that matches you get like a .reindex; if you have NO elements matching you get a KeyError.

Some want to always make it a key error if not ALL elements match. This I think is too restrictive and makes bugs very hard to find.

The other way is also a problem as if you have no matches it would silenty skip a fairly common error condition IMHO.

Note that you are discussing .ix, but it is very unlikely we will change this behavior AT ALL. If anything gets changed it would be .loc which is much more strict and deterministic w/o respect to the type of the index data (IOW, .ix has lots and lots of edge cases)

jreback on 29 Jul 2015

@sinhrks: yes of course (apologies), how about this:

>>> df = pd.DataFrame(np.random.randint(10,size=(8,8)),index=pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'], ['one', 'two']],names=['first','second']))
>>> df 
              0  1  2  3  4  5  6  7
first second                        
bar   one     0  5  5  5  6  2  6  8
      two     2  6  9  0  3  6  7  9
baz   one     9  0  9  9  2  5  7  4
      two     4  8  1  2  9  2  8  1
foo   one     2  7  3  6  5  5  5  2
      two     3  4  6  2  7  7  1  2
qux   one     0  8  5  9  5  5  7  3
      two     7  4  0  7  3  6  8  6

# works great:
>>> df.loc[['bar']]
              0  1  2  3  4  5  6  7
first second
bar   one     9  4  6  0  9  2  0  6
      two     0  2  0  4  2  6  5  4

# not so much:
>>> df.loc[['ba']]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/fil/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py", line 1189, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/Users/fil/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py", line 1323, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/Users/fil/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py", line 972, in _getitem_iterable
    raise KeyError("cannot index a multi-index axis with these keys")
KeyError: 'cannot index a multi-index axis with these keys'

# my workaround: (but this is not "type safe")
>>> df.loc[df.reset_index()['first'].isin(['ba']).values]
Empty DataFrame
Columns: [0, 1, 2, 3, 4, 5, 6, 7]
Index: []

# still works for valid values:
>>> df.loc[df.reset_index()['first'].isin(['bar']).values]
              0  1  2  3  4  5  6  7
first second
bar   one     9  4  6  0  9  2  0  6
      two     0  2  0  4  2  6  5  4

@jreback : apologies I should have used .loc. Should I treat .ix as deprecated, and avoid use?

I guess we could have three different indexers, depending on how you want missing values to be handled (.xx_union[['bar','ba']] / .xx_intersect[['bar', 'ba'] / .reindex(['bar', 'ba'), or a parameter that specifies how to handle it (eg. .xx(missing='error/intersect/union')[['bar', 'ba']]).

Perhaps .reindex could accept an error/intersect/union? Aren't .reindex and .loc fundamentally the same thing, or am I missing a key distinction? One distinction would be that .loc accepts boolean series: but instead the indexing values case.

filmackay on 29 Jul 2015

@filmackay adding ANOTHER indexer is a non-starter, too complicated already.

however, adding a missing='error'|'intersect'|'union' to .loc/.ix/.reindex (in the function call) might be worthwhile

jreback on 29 Jul 2015

.ix is not deprecated as it has a small set of cases where its useful, but for ore consistent behavior (in the presence of integer indexes), .loc is recommended

jreback on 29 Jul 2015

Is it crazy to think that .loc, .ix and .reindex could be merged into a single indexer? Is it legacy difficult, or do you think it's fundamentally flawed to try to encompass all use cases?

filmackay on 29 Jul 2015

.ix needs to stay like it is as there are a few edge cases which .loc won't ever support
e.g. simultaneous indexing by location is one dim AND by position in another
df.ix[[1,3,5],[['A','B']] where the index is say an object dtype. (otherwise you could use in this example df.loc[[df.index[[1,3,5]],[['A','B']])

.reindex and .loc are pretty equivalent. The discussion is basically to go back to where they are different, e.g..reindex will give you all values that you asked for (but NaN them if they don't exist), and then .loc can raise KeyError). BUT the issue is that setting is different than getting.

E.g. since you cannot use .reindex on the LHS of an expression, e.g. df.reindex(...) = ... is a python error, then doing:

df.loc[....] = .... is the only method. So therefor you don't want to have a situation where you can get something in a different way then you can set something. So imagine you are setting an index value that doesn't exist, you simply set it, done (this is enlargement), but now should you raise when you get something non-existant? So these are tied.

Currently I think we are in a consistent state that is predictbile. .reindex is the same as .loc, and getting and setting are mirror images. Downside is that its very tricky for .loc to figure out when to raise (currently only when NOTHING matches).

So a proposal to add a errors='strict|missing' might make sense.

jreback on 31 Jul 2015

I can create a new issue for this. I can also provide an example. Let me know if I need to do either.

I would support an idea where adding errors='strict|missing' to .xs would solve the problem.

Dr-Irv on 30 Oct 2015

👍2

I have an example that illustrates what I'd like to do, and a proposal. Here's the example:

citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'), 
             ('Boston', 'New York'), ('Boston', 'San Francisco')]
index = pd.MultiIndex.from_tuples(citypairs, names=['origin','dest'])
s = pd.Series([i*10+10 for i in range(5)], index=index)
# Compute all of the cities that appear as an origin or a destination
cities = set(p[0] for p in citypairs).union(set(p[1] for p in citypairs))
osums = { c : s.loc[c,:].sum() for c in cities}
dsums = { c : s.loc[:,c].sum() for c in cities}

The idea in this example is that I have data from a data source that has pairs of cities, and need to compute sums of the series for each city that appears by origin and by destination. In the above code, the computation of osums fails with a KeyError (due to the empty slice for 'San Francisco') and the computation of dsums fails with an IndexingError (due to the empty slice for 'Miami').
My proposal is as follows. Introduce a new indexer called .sloc (slicing loc) that has the same exact behavior as .loc, except that if an error occurs, an empty Series, DataFrame, or Panel is returned (dependent on what was being sliced). It is documented this way, with the warning that the user has to recognize that this could cause errors to get missed. Caveat Emptor
@jreback I know you said above that you didn't want to introduce a new indexer, but this solution seems rather clean, and I'd be willing to try to implement it as it would really help my applications.

Dr-Irv on 22 Dec 2015

this adds unneeded complexity
this is exactly what Categoricals get you

jreback on 22 Dec 2015

@jreback I tried with categoricals in a MultiIndex and still get an indexing problem if something is missing. It's because the categories aren't pushed down to the MultiIndex. Here's the example. Am I doing something wrong?

citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'), 
             ('Boston', 'New York'), ('Boston', 'San Francisco')]
vals = [i*10+10 for i in range(5)]
df = pd.DataFrame({ 'orig' : [p[0] for p in citypairs],
                    'dest' : [p[1] for p in citypairs],
                    'vals' : vals})
df['orig'] = df['orig'].astype("category")
df['dest'] = df['dest'].astype("category")
cities = set(p[0] for p in citypairs).union(set(p[1] for p in citypairs))
df['orig'].cat.set_categories(cities)
df['dest'].cat.set_categories(cities)
df.set_index(['orig','dest'],inplace=True)
osums = { c : df.loc[c,:].sum() for c in cities}
dsums = { c : df.loc[:,c].sum() for c in cities}

Now the error is KeyError: 'the label [San Francisco] is not in the [index]'.

I don't see how adding the extra .sloc adds complexity since it will work just like .loc but not fail. It makes codes like the one I initially wrote above quite clean.

Dr-Irv on 22 Dec 2015

this is what .reindex does

jreback on 23 Dec 2015

@jreback I apologize for not understanding you, and it's likely I'm missing something, but I can't see how to make .reindex work and have "clean" code. In my original example, I used a MultiIndexconsisting of two columns that were strings. You suggested to use Categoricals. I tried that, and I can't get that to work either.

My proposal for a .sloc would be implemented by subclassing the current _LocationIndexer and just catching the exception when __getitem__ is called to return an empty object. So I don't see the complexity you refer to. In my original example, if .sloc were implemented, then just using .sloc rather than .loc would produce nice looking code.

Incidentally, here is what I want to do using dicts but not pandas, but the solution doesn't scale well when the index has lots of elements in the tuples. The if tests are not nice looking.

citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'), 
             ('Boston', 'New York'), ('Boston', 'San Francisco')]
vals = [i*10+10 for i in range(5)]
adict = { z[0] : z[1] for z in zip(citypairs, vals)}
dictosums = { c : sum(adict[(c2,i)] for (c2,i) in adict.keys() if c==c2) for c in cities}
dictdsums = { c : sum(adict[(i,c2)] for (i,c2) in adict.keys() if c==c2) for c in cities}

Dr-Irv on 23 Dec 2015

@Dr-Irv Maybe not fully related to this discussion, but the summing for each of the levels that you are trying to do can also be achieved using sum and specifying the level to sum over:

In [21]: df
Out[21]:
                        vals
orig     dest
Miami    Boston           10
         New York         20
New York San Francisco    30
Boston   New York         40
         San Francisco    50

In [26]: df.sum(level=0)
Out[26]:
          vals
orig
Boston      90
Miami       30
New York    30

In [27]: df.sum(level=1)
Out[27]:
               vals
dest
Boston           10
New York         60
San Francisco    80

jorisvandenbossche on 23 Dec 2015

@jorisvandenbossche Thanks, but the issue is that the result of that sum does not include the zero values for the missing cities. In the code that I wrote above using dictionaries, the results are:

{'Boston': 90, 'New York': 30, 'San Francisco': 0, 'Miami': 30}
{'Boston': 10, 'New York': 60, 'San Francisco': 80, 'Miami': 0}

Note the zero values for San Francisco and Miami in the two respective sums.

The reason the zero values are needed is because there is other code that needs the sums for all cities.

Dr-Irv on 23 Dec 2015

@Dr-Irv you can simply fully expand the index levels via .reindex if you want. you must specify the full scheme, as this is not automatic and not obvious how you want to do this.

Then you can do whatever you want, including using .fillna(0) if you want to preserve values.

In [61]: df.reindex(index=pd.MultiIndex.from_product([list(cities),list(cities)]))
Out[61]: 
                             vals
New York      New York        NaN
              San Francisco    30
              Miami           NaN
              Boston          NaN
San Francisco New York        NaN
              San Francisco   NaN
              Miami           NaN
              Boston          NaN
Miami         New York         20
              San Francisco   NaN
              Miami           NaN
              Boston           10
Boston        New York         40
              San Francisco    50
              Miami           NaN
              Boston          NaN

creating another indexer is a complete non-starter as it would make indexing even MORE confusing (we already have [],.ix,.iloc,.loc,.at,.iat. This is WAY more than enough.

jreback on 23 Dec 2015

@jreback The problem with your solution is when there are 1000 different cities, but the original data has 10,000 city pairs. (The example data comes from the representation of a graph). Your solution above creates a DataFrame with 1,000,000 entries, most of which are unnecessary.

I understand the potential confusion of adding another indexer, but that then brings us back to the possibility of adding the errors=strict|missing optional argument to .loc. Or, how about doing this via a pandas option that determines the behavior of .loc in terms of whether it does the things the current way (reporting KeyError) or just returns empty objects when the key is missing. Having an option might be more elegant, otherwise, you end up with errors=missing code in lots of places. I'd be willing to give this a try in terms of implementing it.

Dr-Irv on 23 Dec 2015

@Dr-Irv you can't have it both ways, either you have a sparse repr which is what a MultiIndex is, or you don't. Try to solve your problem with existing machinery, rather than resorting to looping and creating new APIs.

To be honest pandas is not very good at representing graphs. Trying to shove things in like this are non-starters.

Not to mention your code above in non-performant.

jreback on 23 Dec 2015

@jreback But this brings us back to the original problem (which started the discussion above by @filmackay), which is that I have a sparse representation using MultiIndex, but if I slice with a key that is not there, I get an error, as opposed to an empty DataFrame (which I can then apply .sum to).

I think this might be related to the discussion in #4036, as I have use cases where there is a MultiIndex with lots of levels, where I need to do sums for different combinations of the keys (i.e., fixing the keys for some levels, and summing across another level), and there may be combinations that don't exist, so if the .sum over the slice would return 0, the code would be easy to understand and explain.

Dr-Irv on 23 Dec 2015

@Dr-Irv again specific for your problem: you can also do the reindex _after_ the summing for each level:

In [54]: df.sum(level=0).reindex(list(cities))
Out[54]:
               vals
orig
New York         30
San Francisco   NaN
Miami            30
Boston           90

I think in this specific case, you will find a better and more performant solution as iterating through the dataframe. But nonetheless, the original question in this issue can still be relevant of course.

@jreback I agree that adding yet another indexer is not the best way forward, but I was wondering if there would be room for a method (so not an indexer, and so only usable for getting values and not setting) to do this? Some kind of general 'getitem' method, and as this is a method it would be easier to add keyword arguments to eg specify what should happen if the label is not found.
Actually there is a DataFrame.get, but this is restricted to the info axis.

jorisvandenbossche on 23 Dec 2015

building on @jorisvandenbossche soln, I think this is what you want.

In [4]: df.sum(level=0).reindex(index=list(cities)).fillna(0)
Out[4]: 
               vals
orig               
New York         30
San Francisco     0
Miami            30
Boston           90

jreback on 23 Dec 2015

@jreback @jorisvandenbossche Thank you very much for your response. While I appreciate your solution, it isn't as elegant as the one I propose. I'm trying to create something for teaching purposes that looks easy.

Regarding the comment about DataFrame.get, what would really be nice is to have something like that, but for the main (axis=0) axis, with all the same slicing that exists for .loc. In other words, I'd like to be able to do something like NDFrame.rows[a,b,:,d,e] that would return just the rows corresponding to that slice. This would work the same way for Series, DataFrame and Panel. It slices along the first dimension, where you're likely to have a MultiIndex. If it did that, and returned empty objects when the slices were empty, then I get back to clean code for my examples, and I think we address the main point of this issue that was originally raised. Put another way, we create a method that is a shorthand for .loc(axis=0) and document that if a key is specified and is missing, then an empty object is returned.

Dr-Irv on 24 Dec 2015

@Dr-Irv you iterating over values is not idiomatic at all
and completely non performant

you have a very nice soln above - this is a very pandonic soln
and esp if you are teaching this is the only way to go

individually indexing is NOT a soln

jreback on 24 Dec 2015

Respectfully, can I chime in and agree with Dr Irv?

His .sloc suggestion seems reasonable.

The other way is also a problem as if you have no matches it would silenty skip a fairly common error condition IMHO.

The accuracy of this assumption depends on context. It seems reasonable that there are some programming contexts where returning the empty frame is the expected behavior. (For example, if the result is married with an aggregator like sum, which is the specific need here).

dicts have both get and [] for that reason. When I see one, I know a missing entry is not an error, and when I see the other, I know it is. I'm not sure why this issue is different.

pjcpjc on 28 Jan 2016

@pjcpjc you can certainly comment!

We cannot extend this already way way too complicated API any more

we have: [], .ix, .iloc, .loc, .iat, at.

so you want another one?

as I said above, a keyword argument would possibly be ok, but certainly not another indexer.

We have to have a default. This certainly may not work for everyone (and that is why we have
options).

jreback on 28 Jan 2016

Or a keyword argument to .loc that does what Dr. Irvs .sloc does... with the default value for the keyword being consistent with the current behavior.

Whatever you prefer.

We cannot extend this already way way too complicated API any more

I think the spelling in this case is MOAR!! ;)

But seriously, I am familiar with the domain here (agruably I am an expert and Dr Irv is a guru) and he is right on the money in terms of identifying a context in which the natural result is an empty frame and not an error.

I don't think he is being non-performant. He can get what he wants (I think) with a helper function, but that will look awkward when trying to convince other people to jump over from their current legacy language.

We're not trying to make trouble - we're trying to bring optimization programming into the 21st century and use Python + pandas.

pjcpjc on 28 Jan 2016

this closed by #15747. .loc and .reindex are distinct and have separate purposes.

jreback on 18 Jan 2018

If anyone cares, I now regret some of the strong language I used above. I think pandas, while awesome, isn't the right vehicle for the sort of idioms that optimization people are accustomed to. Optimization people should either write pandas code pandonically or use different data structures. There are a small mountain of examples over at https://github.com/opalytics/opalytics-ticdat/tree/master/examples if anyone is interested, to include a pandonic example.

pjcpjc on 18 Jan 2018

😄1

Was this page helpful?

0 / 5 - 0 ratings