Pandas: best practice for combining partial string indexing + regular mask?

Created on 18 Jul 2017  路  15Comments  路  Source: pandas-dev/pandas

Consider the simple example as follows

import pandas as pd

dataframe = pd.DataFrame({'time' : [pd.to_datetime('2016-06-06'),
                                    pd.to_datetime('2016-06-06'),
                                    pd.to_datetime('2016-06-07'),
                                    pd.to_datetime('2016-06-08')],
                            'value' : [1,2,3,4],
                            'group' : ['A','B','A','B']})

dataframe.set_index('time', inplace = True)

dataframe
Out[13]: 
           group  value
time                   
2016-06-06     A      1
2016-06-06     B      2
2016-06-07     A      3
2016-06-08     B      4

Now I want to use all the cool partial string indexing functions, but also be able to filter on other variables. The solution I come up with is the syntax

dataframe.loc[dataframe['group'] == 'A'].loc['2016-06-06']

which looks really horrible with the two loc combined. Is that the correct pandonic way to do so (while keeping the index?) I have read and read again the documentation but I cannot find the answer.

Thanks!

Enhancement Indexing

Most helpful comment

Right, but you get a Series returned from that boolean test. The indices in that Series won't match those of the dataframe.

All 15 comments

@randomgambit : Thanks for the question! I think you might be better served emailing this question out to the pandas-dev mailing list or reaching out on our Gitter channel.

I will close this issue for now unless there turns out to be a real issue (or enhancement) with the pandas codebase that can be well addressed with the example you have provided.

thanks @gfyoung but can you at least tell me if that's the correct syntax here? Its not easy to send en example via email, and there is nobody on gitter

We don't directly have an exposed partial string indexer. Instead .get_loc will do this, but it can return various things (e.g. a single index, array of booleans, or slice), so it is not directly user facing.

In [15]: dfi = dataframe.index

In [16]: dataframe.loc[(dataframe['group'] == 'A') & dfi.isin(dfi[dfi.get_loc('2016-06-06')])]
Out[16]: 
           group  value
time                   
2016-06-06     A      1

@jreback thanks a lot. I think this would be a really nice feature to have. partial string indexing is so useful, but 90% of the time I need to combine if with some other logical condition so I end up doing some df.reset_index().loc[] which is super annoying.

When you say its not user facing, you mean I should not use it?

i mean there is no direct way to do this aside from the verbose code that i posted

i suppose u could make an index function to wrap this as a convenience

if u want to post a signature / doc string can take a look

Given where the discussion has gone, I'll reopen this.

but can you at least tell me if that's the correct syntax here? Its not easy to send en example via email, and there is nobody on gitter

@randomgambit : Sorry about that! I was responding from my phone, so I was hoping email or Gitter could provide you with a more comprehensive response.

@gfyoung @jreback I tried to find more conventional solutions and actually I found this funny thing:

dataframe.loc[(dataframe.reset_index()['time'] == pd.to_datetime('2016-06-06')) & (dataframe['group'] == 'A')]

FAILS

dataframe.reset_index(inplace = True)
dataframe.loc[(dataframe['time'] == pd.to_datetime('2016-06-06')) & (dataframe['group'] == 'A')]

WORKS

in purely logical terms, I dont see why 1. does not work here

@randomgambit : What indices do you get from each call (i.e. compare what is being passed in-between the outermost brackets)?

@gfyoung

dataframe.set_index('time', inplace = True)

dataframe.reset_index()['time']
Out[35]: 
0   2016-06-06
1   2016-06-06
2   2016-06-07
3   2016-06-08
Name: time, dtype: datetime64[ns]

dataframe.reset_index(inplace = True)

dataframe['time']
Out[37]: 
0   2016-06-06
1   2016-06-06
2   2016-06-07
3   2016-06-08
Name: time, dtype: datetime64[ns]

And in your first example, what does the index look like? The DataFrame in your two examples are not the same because one had its index reset (second example) but not the other (first example).

@gfyoung I am sorry I dont understand what you want me to do here. I get that they are not the same, but I dont see why dataframe.loc[(dataframe.reset_index()['time'] == pd.to_datetime('2016-06-06')) & (dataframe['group'] == 'A')] does not work

(dataframe.reset_index()['time'] == pd.to_datetime('2016-06-06')) is simply a boolean test

Right, but you get a Series returned from that boolean test. The indices in that Series won't match those of the dataframe.

@gfyoung got it, that works

dataframe.loc[(dataframe.reset_index()['time'] == pd.to_datetime('2016-06-06')).tolist() & (dataframe['group'] == 'A')]

Awesome! Glad we could resolve that. Feel free to post a PR to wrap the functionality you shared above so that we don't have to write this out again :smile:

Was this page helpful?
0 / 5 - 0 ratings