Pandas: GroupBy using TimeGrouper does not work

Created on 7 Jun 2013 · 19Comments · Source: pandas-dev/pandas

BUG: TimeGrouper not too friendly with other groups, e.g.

df.set_index('Date').groupby([pd.TimeGrouper('6M'),'Branch']).sum()
should work

Hi everybody,

I found two issues with TimeGrouper:

1) TimeGrouper does not work at all:

Let's take the following example:

df = pd.DataFrame({
'Branch' : 'A A A A A B'.split(),
'Buyer': 'Carl Mark Carl Joe Joe Carl'.split(),
'Quantity': [1,3,5,8,9,3],
'Date' : [
DT.datetime(2013,1,1,13,0),
DT.datetime(2013,1,1,13,5),
DT.datetime(2013,10,1,20,0),
DT.datetime(2013,10,3,10,0),
DT.datetime(2013,12,2,12,0),
DT.datetime(2013,12,2,14,0),
]})

gr = df.groupby(pd.TimeGrouper(freq='6M'))

def testgr(df):
print df

gr.apply(testgr)

This will raise the Exception: "Exception: All objects passed were None"

2) With previous Panda's version it was not possible to combine TimeGrouper with another criteria such as "Branch" in my case.

Thank you very much

Andy

Bug Groupby

Source

aschilling

Most helpful comment

You need to set_index as TimeGrouper operates on the index

In [15]: df
Out[15]: 
  Branch Buyer                Date  Quantity
0      A  Carl 2013-01-01 13:00:00         1
1      A  Mark 2013-01-01 13:05:00         3
2      A  Carl 2013-10-01 20:00:00         5
3      A   Joe 2013-10-03 10:00:00         8
4      A   Joe 2013-12-02 12:00:00         9
5      B  Carl 2013-12-02 14:00:00         3

In [16]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).sum()
Out[16]: 
            Quantity
2013-01-31         4
2013-07-31       NaN
2014-01-31        25

jreback on 7 Jun 2013

👍3

All 19 comments

You need to set_index as TimeGrouper operates on the index

In [15]: df
Out[15]: 
  Branch Buyer                Date  Quantity
0      A  Carl 2013-01-01 13:00:00         1
1      A  Mark 2013-01-01 13:05:00         3
2      A  Carl 2013-10-01 20:00:00         5
3      A   Joe 2013-10-03 10:00:00         8
4      A   Joe 2013-12-02 12:00:00         9
5      B  Carl 2013-12-02 14:00:00         3

In [16]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).sum()
Out[16]: 
            Quantity
2013-01-31         4
2013-07-31       NaN
2014-01-31        25

jreback on 7 Jun 2013

👍3

Thanks and is it possible to combine TimeGrouper with another criteria?

aschilling on 7 Jun 2013

In [28]: df.set_index('Date').groupby(pd.TimeGrouper('6M'))['Quantity'].apply(lambda x: x.count())
Out[28]: 
2013-01-31    2
2013-07-31    0
2014-01-31    4
dtype: int64

jreback on 7 Jun 2013

Hi jreback, thanks for your reply. Sorry but I do not understand your solution, how can I use it to groupby the TimeGrouper criteria and for example by 'Branch' ?

aschilling on 7 Jun 2013

This is actuallly a bit tricky

In [38]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch').sum())
Out[38]: 
                   Quantity
           Branch          
2013-01-31 A              4
2014-01-31 A             22
           B              3

jreback on 7 Jun 2013

@hayd is there a better way to do this you think?

jreback on 7 Jun 2013

I'm afraid I have _no_ idea!

hayd on 7 Jun 2013

I woudl have thought:

df.set_index('Date').groupby([pd.TimeGrouper('6M'),'Branch']).sum()

should work....

jreback on 7 Jun 2013

I tried that but already but it raises the exception: "TypeError: 'TimeGrouper' object is not callable"

aschilling on 7 Jun 2013

did you see my prior comment?

jreback on 7 Jun 2013

I would have thought that would work too....

I had tried a few variations of your solution... None of which I could get working (hence the other issue I posted) :)

hayd on 7 Jun 2013

@jreback Yes, but that does not work for me either, because I need to apply a self defined function to the formed GroupBy Object. If I already use the simple function above with your solution:

df.groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch').apply(testgr))

It raises: "AttributeError: 'DataFrame' object has no attribute 'name'"

aschilling on 7 Jun 2013

This is basically a composition operation, you group by time, then apply a function

which happens to group by branch then operates, so u need to operate on the inner function

This is quite tricky in that your function should return a scalar value on the single passed series

In [44]: def testf(s):
   ....:     return s.sum()
   ....: 

In [46]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch')['Quantity'].apply(testf))
Out[46]: 
            Branch
2013-01-31  A          4
2014-01-31  A         22
            B          3
dtype: int64

jreback on 7 Jun 2013

@jreback Thanks! Unfortunately, it does not solve my problem, as I need to operate on various columns in my function. Is there any possibility to pass the Buyer column to the function?

df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch')[['Buyer', 'Quantity']].apply(testgr)) # doubles the buyer names

aschilling on 7 Jun 2013

If you return a custom function then you need to handle the string cases, but you can return pretty much anything you want (make it a Series) to get this kind of functionaility, you function is passed a slice of the original frame

In [55]: def testf(df):
   ....:     if (df['Buyer'] == 'Mark').sum() > 0:
   ....:         return Series(dict(quantity = df['Quantity'].sum(), buyer = 'mark'))
   ....:     return Series(dict(quantity = df['Quantity'].sum()*100, buyer = 'other'))
   ....: 

In [56]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch').apply(testf))
Out[56]: 
                   buyer quantity
           Branch                
2013-01-31 A        mark        4
2014-01-31 A       other     2200
           B       other      300

jreback on 7 Jun 2013

👍1

Great solution ! Thanks a lot

aschilling on 7 Jun 2013

great...glad it worked out (and I am going to open an issue about a defect in that:

df.set_index('Date').groupby([pd.TimeGrouper('6M'),'Branch']).sum()

should work...

thanks for bringing it up!

jreback on 7 Jun 2013

It appears other methods that work on normal groups fail when using the TimeGrouper. For instance

 data.groupby('Col1').groups

works fine, but:

data.groupby(pd.TimeGrouper(freq='H')).groups

gives error:

'DataFrameGroupBy' object has no attribute 'groups'

Am I misunderstanding something?

Edit: Looks like this has become an open issue: https://github.com/pydata/pandas/issues/3881

JamesPHoughton on 19 Jun 2013

see #3881, already noted, thank you

jreback on 19 Jun 2013

Was this page helpful?

0 / 5 - 0 ratings

Related issues

read_csv(filename_with_asian_locale) failed in python 3.6 for windows

mfmain · 3Comments

df.duplicated and drop_duplicates raise TypeError with set and list values.

Abrosimov-a-a · 3Comments

DataFrame.describe can't return percentiles when data set contain nan

tade0726 · 3Comments

AttributeError: Cannot use pandas from a script file

songololo · 3Comments

ValueError plotting bar plot from DataFrame with existing Axes

swails · 3Comments