xref https://github.com/pandas-dev/pandas/pull/21573#discussion_r197425013
In [2]: df = pd.DataFrame({'group': [1, 1, 2],
'category_string': pd.Series(list('abc')).astype('category'),
'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern')})
In [3]: df.groupby('group').first()
Out[3]:
category_string datetimetz
group
1 a 2013-01-01 05:00:00
2 c 2013-01-03 05:00:00
The example above passes data through the first/last compat method which strips timezone information. PR #15885 (now closed) should fix this issue (and offer a performance boost to Categorial data as mentioned in #19026)
Hey i would like to take up this issue. Also i am new to open source..
Could you elaborate a bit more on what this issue is about ?
@mroeschke haven't looked in detail yet but do you know if there's a reason why we even have a compat function here? Under the impression the Cythonized version should work here, no?
@WillAyd Just quickly looking at the code, like a general first method was created and is invoked via apply for DataFrame groups and just directly on Series groups. I'm not too familiar with the groupby routines, but wouldn't bringing this into cython strip the current extension dtypes (i.e Categorical -> Object and Datetimetz -> Datetime)?
@aggarwalvinayak the datetimetz column above has lost timezone information when it was in the original DataFrame. The solution would be to impliment the changes in #15885
It may in it鈥檚 current form, but at least in theory I don鈥檛 think the first and last operations should have to behave differently based on the type being referenced. There are indexing operations for things like shift that can operate on any type arbitrarily, so seems logical that could apply to accessing the first and last elements as well.
That鈥檚 a larger change I can look into and perhaps deserves a separate issue. @aggarealvinayak feel free to continue diagnosing this as prescribed above
this needs to be fixed in cython i think
@mroeschke
The solution would be to impliment the changes in #15885
Do u mean to revert back the changes that were introduced in #15885
And @WillAyd Which procedure of diagnosing are you mentioning exactly?
Fair point @WillAyd, the indexing should be agnostic to the data types.
Alternatively, I was thinking; why isn't first/last just implemented in terms of the nth method? It looks to handle this issue correctly correctly and to be more performant:
In [4]: df = pd.DataFrame({'group': [1, 1, 2],
...: 'category_string': pd.Series(list('abc')).astype('category'),
...: 'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern')})
...:
In [5]: df
Out[5]:
group category_string datetimetz
0 1 a 2013-01-01 00:00:00-05:00
1 1 b 2013-01-02 00:00:00-05:00
2 2 c 2013-01-03 00:00:00-05:00
In [6]: df.groupby('group').first() #wrong
Out[6]:
category_string datetimetz
group
1 a 2013-01-01 05:00:00
2 c 2013-01-03 05:00:00
In [7]: df.groupby('group').nth(0) # correct
Out[7]:
category_string datetimetz
group
1 a 2013-01-01 00:00:00-05:00
2 c 2013-01-03 00:00:00-05:00
In [8]: %timeit df.groupby('group').nth(0)
2.52 ms 卤 30.2 碌s per loop (mean 卤 std. dev. of 7 runs, 100 loops each)
In [9]: %timeit df.groupby('group').first()
14.8 ms 卤 1.3 ms per loop (mean 卤 std. dev. of 7 runs, 100 loops each)
like how head/tail is just a wrapper around iloc. This may be a win/win unless first/last has a feature that nth doesn't have?
first/last i think could be in terms of nth (which came later) - nan handling is the same i think (that鈥檚 the defining issue and how they r different from head/tail)
If you take the compat out of the equation first is really just nth passing 1 as n:
last is a separate implementation in Cython. It's a little different than first because with first n will always be 1, but with last n could vary across each group. Perhaps an intelligent consolidation could still occur back in Cython. Here's the implementation of that (nth is in the same module):
Not sure if this is the right test, but it looks like nth supports negative indexing so shouldn't last always be -1?
In [3]: df.groupby('group').nth(-1)
Out[3]:
category_string datetimetz
group
1 b 2013-01-02 00:00:00-05:00
2 c 2013-01-03 00:00:00-05:00
@aggarwalvinayak what you may want to do then is start from @WillAyd comments above and try to investigate if its possible to implement first/last in terms of the nth method.
@aggarwalvinayak welcome to still work on this but just as a heads up I removed the "good first issue" tag as this is a little more complicated touching on Cython code
@WillAyd I am not at all experienced with cython.. Will try to explore about that first.. Because this is my second issue that the good first issue tag was removed because of cython thing.
Most helpful comment
Not sure if this is the right test, but it looks like
nthsupports negative indexing so shouldn't last always be -1?@aggarwalvinayak what you may want to do then is start from @WillAyd comments above and try to investigate if its possible to implement first/last in terms of the nth method.