BUG: TimeGrouper not too friendly with other groups, e.g.
df.set_index('Date').groupby([pd.TimeGrouper('6M'),'Branch']).sum()
should work
Hi everybody,
I found two issues with TimeGrouper:
1) TimeGrouper does not work at all:
Let's take the following example:
df = pd.DataFrame({
'Branch' : 'A A A A A B'.split(),
'Buyer': 'Carl Mark Carl Joe Joe Carl'.split(),
'Quantity': [1,3,5,8,9,3],
'Date' : [
DT.datetime(2013,1,1,13,0),
DT.datetime(2013,1,1,13,5),
DT.datetime(2013,10,1,20,0),
DT.datetime(2013,10,3,10,0),
DT.datetime(2013,12,2,12,0),
DT.datetime(2013,12,2,14,0),
]})
gr = df.groupby(pd.TimeGrouper(freq='6M'))
def testgr(df):
print df
gr.apply(testgr)
This will raise the Exception: "Exception: All objects passed were None"
2) With previous Panda's version it was not possible to combine TimeGrouper with another criteria such as "Branch" in my case.
Thank you very much
Andy
You need to set_index as TimeGrouper operates on the index
In [15]: df
Out[15]:
Branch Buyer Date Quantity
0 A Carl 2013-01-01 13:00:00 1
1 A Mark 2013-01-01 13:05:00 3
2 A Carl 2013-10-01 20:00:00 5
3 A Joe 2013-10-03 10:00:00 8
4 A Joe 2013-12-02 12:00:00 9
5 B Carl 2013-12-02 14:00:00 3
In [16]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).sum()
Out[16]:
Quantity
2013-01-31 4
2013-07-31 NaN
2014-01-31 25
Thanks and is it possible to combine TimeGrouper with another criteria?
In [28]: df.set_index('Date').groupby(pd.TimeGrouper('6M'))['Quantity'].apply(lambda x: x.count())
Out[28]:
2013-01-31 2
2013-07-31 0
2014-01-31 4
dtype: int64
Hi jreback, thanks for your reply. Sorry but I do not understand your solution, how can I use it to groupby the TimeGrouper criteria and for example by 'Branch' ?
This is actuallly a bit tricky
In [38]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch').sum())
Out[38]:
Quantity
Branch
2013-01-31 A 4
2014-01-31 A 22
B 3
@hayd is there a better way to do this you think?
I'm afraid I have _no_ idea!
I woudl have thought:
df.set_index('Date').groupby([pd.TimeGrouper('6M'),'Branch']).sum()
should work....
I tried that but already but it raises the exception: "TypeError: 'TimeGrouper' object is not callable"
did you see my prior comment?
I would have thought that would work too....
I had tried a few variations of your solution... None of which I could get working (hence the other issue I posted) :)
@jreback Yes, but that does not work for me either, because I need to apply a self defined function to the formed GroupBy Object. If I already use the simple function above with your solution:
df.groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch').apply(testgr))
It raises: "AttributeError: 'DataFrame' object has no attribute 'name'"
This is basically a composition operation, you group by time, then apply a function
which happens to group by branch then operates, so u need to operate on the inner function
This is quite tricky in that your function should return a scalar value on the single passed series
In [44]: def testf(s):
....: return s.sum()
....:
In [46]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch')['Quantity'].apply(testf))
Out[46]:
Branch
2013-01-31 A 4
2014-01-31 A 22
B 3
dtype: int64
@jreback Thanks! Unfortunately, it does not solve my problem, as I need to operate on various columns in my function. Is there any possibility to pass the Buyer column to the function?
df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch')[['Buyer', 'Quantity']].apply(testgr)) # doubles the buyer names
If you return a custom function then you need to handle the string cases, but you can return pretty much anything you want (make it a Series) to get this kind of functionaility, you function is passed a slice of the original frame
In [55]: def testf(df):
....: if (df['Buyer'] == 'Mark').sum() > 0:
....: return Series(dict(quantity = df['Quantity'].sum(), buyer = 'mark'))
....: return Series(dict(quantity = df['Quantity'].sum()*100, buyer = 'other'))
....:
In [56]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch').apply(testf))
Out[56]:
buyer quantity
Branch
2013-01-31 A mark 4
2014-01-31 A other 2200
B other 300
Great solution ! Thanks a lot
great...glad it worked out (and I am going to open an issue about a defect in that:
df.set_index('Date').groupby([pd.TimeGrouper('6M'),'Branch']).sum()
should work...
thanks for bringing it up!
It appears other methods that work on normal groups fail when using the TimeGrouper. For instance
data.groupby('Col1').groups
works fine, but:
data.groupby(pd.TimeGrouper(freq='H')).groups
gives error:
'DataFrameGroupBy' object has no attribute 'groups'
Am I misunderstanding something?
Edit: Looks like this has become an open issue: https://github.com/pydata/pandas/issues/3881
see #3881, already noted, thank you
Most helpful comment
You need to set_index as TimeGrouper operates on the index