I just upgraded to latest Pandas 0.16, and the new error has hit me (with Pandas 0.15) when slicing with multiple values. (df.ix[[list_of_values]]). I actually think it is more valid to return an empty DataFrame, than to throw an error.
The best I've been able to come up with to reproduce the previous behaviour (fail silently, return empty DataFrame) is:
# I want to filter a DataFrame by the first level of the index
df.ix[df.reset_index().IndexValue.isin(list_of_values).values]
Not saying I'm right on the error/empty argument; but is the above the most elegant solution?
Perhaps we should consider three distinct slicing operations:
I would think anyone indexing would be vary aware of what they are expecting from the above?
Thanks for the report. Can you attach the script which include the sample data preparation?
this is discussed tagentially in #10549
this is the discussion is .loc like .reindex Or not. Or somewhere in between (where we are now). If you have a single element at least that matches you get like a .reindex; if you have NO elements matching you get a KeyError.
Some want to always make it a key error if not ALL elements match. This I think is too restrictive and makes bugs very hard to find.
The other way is also a problem as if you have no matches it would silenty skip a fairly common error condition IMHO.
Note that you are discussing .ix, but it is very unlikely we will change this behavior AT ALL. If anything gets changed it would be .loc which is much more strict and deterministic w/o respect to the type of the index data (IOW, .ix has lots and lots of edge cases)
@sinhrks: yes of course (apologies), how about this:
>>> df = pd.DataFrame(np.random.randint(10,size=(8,8)),index=pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'], ['one', 'two']],names=['first','second']))
>>> df
0 1 2 3 4 5 6 7
first second
bar one 0 5 5 5 6 2 6 8
two 2 6 9 0 3 6 7 9
baz one 9 0 9 9 2 5 7 4
two 4 8 1 2 9 2 8 1
foo one 2 7 3 6 5 5 5 2
two 3 4 6 2 7 7 1 2
qux one 0 8 5 9 5 5 7 3
two 7 4 0 7 3 6 8 6
# works great:
>>> df.loc[['bar']]
0 1 2 3 4 5 6 7
first second
bar one 9 4 6 0 9 2 0 6
two 0 2 0 4 2 6 5 4
# not so much:
>>> df.loc[['ba']]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/fil/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py", line 1189, in __getitem__
return self._getitem_axis(key, axis=0)
File "/Users/fil/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py", line 1323, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/Users/fil/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py", line 972, in _getitem_iterable
raise KeyError("cannot index a multi-index axis with these keys")
KeyError: 'cannot index a multi-index axis with these keys'
# my workaround: (but this is not "type safe")
>>> df.loc[df.reset_index()['first'].isin(['ba']).values]
Empty DataFrame
Columns: [0, 1, 2, 3, 4, 5, 6, 7]
Index: []
# still works for valid values:
>>> df.loc[df.reset_index()['first'].isin(['bar']).values]
0 1 2 3 4 5 6 7
first second
bar one 9 4 6 0 9 2 0 6
two 0 2 0 4 2 6 5 4
@jreback : apologies I should have used .loc. Should I treat .ix as deprecated, and avoid use?
I guess we could have three different indexers, depending on how you want missing values to be handled (.xx_union[['bar','ba']] / .xx_intersect[['bar', 'ba'] / .reindex(['bar', 'ba'), or a parameter that specifies how to handle it (eg. .xx(missing='error/intersect/union')[['bar', 'ba']]).
Perhaps .reindex could accept an error/intersect/union? Aren't .reindex and .loc fundamentally the same thing, or am I missing a key distinction? One distinction would be that .loc accepts boolean series: but instead the indexing values case.
@filmackay adding ANOTHER indexer is a non-starter, too complicated already.
however, adding a missing='error'|'intersect'|'union' to .loc/.ix/.reindex (in the function call) might be worthwhile
.ix is not deprecated as it has a small set of cases where its useful, but for ore consistent behavior (in the presence of integer indexes), .loc is recommended
Is it crazy to think that .loc, .ix and .reindex could be merged into a single indexer? Is it legacy difficult, or do you think it's fundamentally flawed to try to encompass all use cases?
.ix needs to stay like it is as there are a few edge cases which .loc won't ever support
e.g. simultaneous indexing by location is one dim AND by position in another
df.ix[[1,3,5],[['A','B']] where the index is say an object dtype. (otherwise you could use in this example df.loc[[df.index[[1,3,5]],[['A','B']])
.reindex and .loc are pretty equivalent. The discussion is basically to go back to where they are different, e.g..reindex will give you all values that you asked for (but NaN them if they don't exist), and then .loc can raise KeyError). BUT the issue is that setting is different than getting.
E.g. since you cannot use .reindex on the LHS of an expression, e.g. df.reindex(...) = ... is a python error, then doing:
df.loc[....] = .... is the only method. So therefor you don't want to have a situation where you can get something in a different way then you can set something. So imagine you are setting an index value that doesn't exist, you simply set it, done (this is enlargement), but now should you raise when you get something non-existant? So these are tied.
Currently I think we are in a consistent state that is predictbile. .reindex is the same as .loc, and getting and setting are mirror images. Downside is that its very tricky for .loc to figure out when to raise (currently only when NOTHING matches).
So a proposal to add a errors='strict|missing' might make sense.
I have a similar issue with respect to the .xs method. I have a DataFrame df1 indexed with 3 levels (A,B,C), and a column 'flow' and another DataFrame df2 that is indexed by 2 levels (A,B) and a column 'balance'. Note that there are values (a,b) in df2 that are not present in df1. For each pair (a,b) in df2, I want to create a new column in df2 with the values that correspond to df1.xs[(a,b)].flow.sum() + df2.loc[(a,b)].balance . If (a,b) is not in df1, I want the sum to be zero, but you still get the balance value. Right now df1.xs[(a,b)].sum() gives a key error, so I end up with some funky code to test if (a,b) is in df1 when computing the sums. If .xs would just return an empty DataFrame if (a,b) was not there, I'd have very clean code.
I can create a new issue for this. I can also provide an example. Let me know if I need to do either.
I would support an idea where adding errors='strict|missing' to .xs would solve the problem.
I have an example that illustrates what I'd like to do, and a proposal. Here's the example:
citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'),
('Boston', 'New York'), ('Boston', 'San Francisco')]
index = pd.MultiIndex.from_tuples(citypairs, names=['origin','dest'])
s = pd.Series([i*10+10 for i in range(5)], index=index)
# Compute all of the cities that appear as an origin or a destination
cities = set(p[0] for p in citypairs).union(set(p[1] for p in citypairs))
osums = { c : s.loc[c,:].sum() for c in cities}
dsums = { c : s.loc[:,c].sum() for c in cities}
The idea in this example is that I have data from a data source that has pairs of cities, and need to compute sums of the series for each city that appears by origin and by destination. In the above code, the computation of osums fails with a KeyError (due to the empty slice for 'San Francisco') and the computation of dsums fails with an IndexingError (due to the empty slice for 'Miami').
My proposal is as follows. Introduce a new indexer called .sloc (slicing loc) that has the same exact behavior as .loc, except that if an error occurs, an empty Series, DataFrame, or Panel is returned (dependent on what was being sliced). It is documented this way, with the warning that the user has to recognize that this could cause errors to get missed. Caveat Emptor
@jreback I know you said above that you didn't want to introduce a new indexer, but this solution seems rather clean, and I'd be willing to try to implement it as it would really help my applications.
this adds unneeded complexity
this is exactly what Categoricals get you
@jreback I tried with categoricals in a MultiIndex and still get an indexing problem if something is missing. It's because the categories aren't pushed down to the MultiIndex. Here's the example. Am I doing something wrong?
citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'),
('Boston', 'New York'), ('Boston', 'San Francisco')]
vals = [i*10+10 for i in range(5)]
df = pd.DataFrame({ 'orig' : [p[0] for p in citypairs],
'dest' : [p[1] for p in citypairs],
'vals' : vals})
df['orig'] = df['orig'].astype("category")
df['dest'] = df['dest'].astype("category")
cities = set(p[0] for p in citypairs).union(set(p[1] for p in citypairs))
df['orig'].cat.set_categories(cities)
df['dest'].cat.set_categories(cities)
df.set_index(['orig','dest'],inplace=True)
osums = { c : df.loc[c,:].sum() for c in cities}
dsums = { c : df.loc[:,c].sum() for c in cities}
Now the error is KeyError: 'the label [San Francisco] is not in the [index]'.
I don't see how adding the extra .sloc adds complexity since it will work just like .loc but not fail. It makes codes like the one I initially wrote above quite clean.
this is what .reindex does
@jreback I apologize for not understanding you, and it's likely I'm missing something, but I can't see how to make .reindex work and have "clean" code. In my original example, I used a MultiIndexconsisting of two columns that were strings. You suggested to use Categoricals. I tried that, and I can't get that to work either.
My proposal for a .sloc would be implemented by subclassing the current _LocationIndexer and just catching the exception when __getitem__ is called to return an empty object. So I don't see the complexity you refer to. In my original example, if .sloc were implemented, then just using .sloc rather than .loc would produce nice looking code.
Incidentally, here is what I want to do using dicts but not pandas, but the solution doesn't scale well when the index has lots of elements in the tuples. The if tests are not nice looking.
citypairs = [('Miami', 'Boston'), ('Miami','New York'), ('New York', 'San Francisco'),
('Boston', 'New York'), ('Boston', 'San Francisco')]
vals = [i*10+10 for i in range(5)]
adict = { z[0] : z[1] for z in zip(citypairs, vals)}
dictosums = { c : sum(adict[(c2,i)] for (c2,i) in adict.keys() if c==c2) for c in cities}
dictdsums = { c : sum(adict[(i,c2)] for (i,c2) in adict.keys() if c==c2) for c in cities}
@Dr-Irv Maybe not fully related to this discussion, but the summing for each of the levels that you are trying to do can also be achieved using sum and specifying the level to sum over:
In [21]: df
Out[21]:
vals
orig dest
Miami Boston 10
New York 20
New York San Francisco 30
Boston New York 40
San Francisco 50
In [26]: df.sum(level=0)
Out[26]:
vals
orig
Boston 90
Miami 30
New York 30
In [27]: df.sum(level=1)
Out[27]:
vals
dest
Boston 10
New York 60
San Francisco 80
@jorisvandenbossche Thanks, but the issue is that the result of that sum does not include the zero values for the missing cities. In the code that I wrote above using dictionaries, the results are:
{'Boston': 90, 'New York': 30, 'San Francisco': 0, 'Miami': 30}
{'Boston': 10, 'New York': 60, 'San Francisco': 80, 'Miami': 0}
Note the zero values for San Francisco and Miami in the two respective sums.
The reason the zero values are needed is because there is other code that needs the sums for all cities.
@Dr-Irv you can simply fully expand the index levels via .reindex if you want. you must specify the full scheme, as this is not automatic and not obvious how you want to do this.
Then you can do whatever you want, including using .fillna(0) if you want to preserve values.
In [61]: df.reindex(index=pd.MultiIndex.from_product([list(cities),list(cities)]))
Out[61]:
vals
New York New York NaN
San Francisco 30
Miami NaN
Boston NaN
San Francisco New York NaN
San Francisco NaN
Miami NaN
Boston NaN
Miami New York 20
San Francisco NaN
Miami NaN
Boston 10
Boston New York 40
San Francisco 50
Miami NaN
Boston NaN
creating another indexer is a complete non-starter as it would make indexing even MORE confusing (we already have [],.ix,.iloc,.loc,.at,.iat. This is WAY more than enough.
@jreback The problem with your solution is when there are 1000 different cities, but the original data has 10,000 city pairs. (The example data comes from the representation of a graph). Your solution above creates a DataFrame with 1,000,000 entries, most of which are unnecessary.
I understand the potential confusion of adding another indexer, but that then brings us back to the possibility of adding the errors=strict|missing optional argument to .loc. Or, how about doing this via a pandas option that determines the behavior of .loc in terms of whether it does the things the current way (reporting KeyError) or just returns empty objects when the key is missing. Having an option might be more elegant, otherwise, you end up with errors=missing code in lots of places. I'd be willing to give this a try in terms of implementing it.
@Dr-Irv you can't have it both ways, either you have a sparse repr which is what a MultiIndex is, or you don't. Try to solve your problem with existing machinery, rather than resorting to looping and creating new APIs.
To be honest pandas is not very good at representing graphs. Trying to shove things in like this are non-starters.
Not to mention your code above in non-performant.
@jreback But this brings us back to the original problem (which started the discussion above by @filmackay), which is that I have a sparse representation using MultiIndex, but if I slice with a key that is not there, I get an error, as opposed to an empty DataFrame (which I can then apply .sum to).
I think this might be related to the discussion in #4036, as I have use cases where there is a MultiIndex with lots of levels, where I need to do sums for different combinations of the keys (i.e., fixing the keys for some levels, and summing across another level), and there may be combinations that don't exist, so if the .sum over the slice would return 0, the code would be easy to understand and explain.
@Dr-Irv again specific for your problem: you can also do the reindex _after_ the summing for each level:
In [54]: df.sum(level=0).reindex(list(cities))
Out[54]:
vals
orig
New York 30
San Francisco NaN
Miami 30
Boston 90
I think in this specific case, you will find a better and more performant solution as iterating through the dataframe. But nonetheless, the original question in this issue can still be relevant of course.
@jreback I agree that adding yet another indexer is not the best way forward, but I was wondering if there would be room for a method (so not an indexer, and so only usable for getting values and not setting) to do this? Some kind of general 'getitem' method, and as this is a method it would be easier to add keyword arguments to eg specify what should happen if the label is not found.
Actually there is a DataFrame.get, but this is restricted to the info axis.
building on @jorisvandenbossche soln, I think this is what you want.
In [4]: df.sum(level=0).reindex(index=list(cities)).fillna(0)
Out[4]:
vals
orig
New York 30
San Francisco 0
Miami 30
Boston 90
@jreback @jorisvandenbossche Thank you very much for your response. While I appreciate your solution, it isn't as elegant as the one I propose. I'm trying to create something for teaching purposes that looks easy.
Regarding the comment about DataFrame.get, what would really be nice is to have something like that, but for the main (axis=0) axis, with all the same slicing that exists for .loc. In other words, I'd like to be able to do something like NDFrame.rows[a,b,:,d,e] that would return just the rows corresponding to that slice. This would work the same way for Series, DataFrame and Panel. It slices along the first dimension, where you're likely to have a MultiIndex. If it did that, and returned empty objects when the slices were empty, then I get back to clean code for my examples, and I think we address the main point of this issue that was originally raised. Put another way, we create a method that is a shorthand for .loc(axis=0) and document that if a key is specified and is missing, then an empty object is returned.
@Dr-Irv you iterating over values is not idiomatic at all
and completely non performant
you have a very nice soln above - this is a very pandonic soln
and esp if you are teaching this is the only way to go
individually indexing is NOT a soln
Respectfully, can I chime in and agree with Dr Irv?
His .sloc suggestion seems reasonable.
The other way is also a problem as if you have no matches it would silenty skip a fairly common error condition IMHO.
The accuracy of this assumption depends on context. It seems reasonable that there are some programming contexts where returning the empty frame is the expected behavior. (For example, if the result is married with an aggregator like sum, which is the specific need here).
dicts have both get and [] for that reason. When I see one, I know a missing entry is not an error, and when I see the other, I know it is. I'm not sure why this issue is different.
@pjcpjc you can certainly comment!
We cannot extend this already way way too complicated API any more
we have: [], .ix, .iloc, .loc, .iat, at.
so you want another one?
as I said above, a keyword argument would possibly be ok, but certainly not another indexer.
We have to have a default. This certainly may not work for everyone (and that is why we have
options).
Or a keyword argument to .loc that does what Dr. Irvs .sloc does... with the default value for the keyword being consistent with the current behavior.
Whatever you prefer.
We cannot extend this already way way too complicated API any more
I think the spelling in this case is MOAR!! ;)
But seriously, I am familiar with the domain here (agruably I am an expert and Dr Irv is a guru) and he is right on the money in terms of identifying a context in which the natural result is an empty frame and not an error.
I don't think he is being non-performant. He can get what he wants (I think) with a helper function, but that will look awkward when trying to convince other people to jump over from their current legacy language.
We're not trying to make trouble - we're trying to bring optimization programming into the 21st century and use Python + pandas.
this closed by #15747. .loc and .reindex are distinct and have separate purposes.
If anyone cares, I now regret some of the strong language I used above. I think pandas, while awesome, isn't the right vehicle for the sort of idioms that optimization people are accustomed to. Optimization people should either write pandas code pandonically or use different data structures. There are a small mountain of examples over at https://github.com/opalytics/opalytics-ticdat/tree/master/examples if anyone is interested, to include a pandonic example.
Most helpful comment
I have a similar issue with respect to the
.xsmethod. I have a DataFramedf1indexed with 3 levels (A,B,C), and a column 'flow' and another DataFramedf2that is indexed by 2 levels (A,B) and a column 'balance'. Note that there are values (a,b) indf2that are not present indf1. For each pair (a,b) indf2, I want to create a new column indf2with the values that correspond todf1.xs[(a,b)].flow.sum() + df2.loc[(a,b)].balance. If (a,b) is not indf1, I want the sum to be zero, but you still get the balance value. Right nowdf1.xs[(a,b)].sum()gives a key error, so I end up with some funky code to test if (a,b) is indf1when computing the sums. If.xswould just return an empty DataFrame if (a,b) was not there, I'd have very clean code.I can create a new issue for this. I can also provide an example. Let me know if I need to do either.
I would support an idea where adding
errors='strict|missing'to.xswould solve the problem.