Pandas: BUG: groupby.first/last loses timezone information followup

Created on 23 Jun 2018 · 12Comments · Source: pandas-dev/pandas

xref https://github.com/pandas-dev/pandas/pull/21573#discussion_r197425013

In [2]: df = pd.DataFrame({'group': [1, 1, 2],
                           'category_string': pd.Series(list('abc')).astype('category'),
                           'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern')})
In [3]: df.groupby('group').first()
Out[3]: 
      category_string          datetimetz                                                
group                                                                                                                                                     
1                   a 2013-01-01 05:00:00                                                                      
2                   c 2013-01-03 05:00:00

The example above passes data through the first/last compat method which strips timezone information. PR #15885 (now closed) should fix this issue (and offer a performance boost to Categorial data as mentioned in #19026)

Bug Timezones

Source

mroeschke

👍1

Most helpful comment

Not sure if this is the right test, but it looks like nth supports negative indexing so shouldn't last always be -1?

In [3]: df.groupby('group').nth(-1)
Out[3]:
      category_string                datetimetz
group
1                   b 2013-01-02 00:00:00-05:00
2                   c 2013-01-03 00:00:00-05:00

@aggarwalvinayak what you may want to do then is start from @WillAyd comments above and try to investigate if its possible to implement first/last in terms of the nth method.

mroeschke on 23 Jun 2018

👍2

All 12 comments

Hey i would like to take up this issue. Also i am new to open source..
Could you elaborate a bit more on what this issue is about ?

aggarwalvinayak on 23 Jun 2018

@mroeschke haven't looked in detail yet but do you know if there's a reason why we even have a compat function here? Under the impression the Cythonized version should work here, no?

WillAyd on 23 Jun 2018

@WillAyd Just quickly looking at the code, like a general first method was created and is invoked via apply for DataFrame groups and just directly on Series groups. I'm not too familiar with the groupby routines, but wouldn't bringing this into cython strip the current extension dtypes (i.e Categorical -> Object and Datetimetz -> Datetime)?

https://github.com/pandas-dev/pandas/blob/f1aa08c4631aea435d66c161557e44f7c72a618e/pandas/core/groupby/groupby.py#L1436

@aggarwalvinayak the datetimetz column above has lost timezone information when it was in the original DataFrame. The solution would be to impliment the changes in #15885

mroeschke on 23 Jun 2018

It may in it’s current form, but at least in theory I don’t think the first and last operations should have to behave differently based on the type being referenced. There are indexing operations for things like shift that can operate on any type arbitrarily, so seems logical that could apply to accessing the first and last elements as well.

That’s a larger change I can look into and perhaps deserves a separate issue. @aggarealvinayak feel free to continue diagnosing this as prescribed above

WillAyd on 23 Jun 2018

this needs to be fixed in cython i think

jreback on 23 Jun 2018

@mroeschke

The solution would be to impliment the changes in #15885

Do u mean to revert back the changes that were introduced in #15885
And @WillAyd Which procedure of diagnosing are you mentioning exactly?

aggarwalvinayak on 23 Jun 2018

Fair point @WillAyd, the indexing should be agnostic to the data types.

Alternatively, I was thinking; why isn't first/last just implemented in terms of the nth method? It looks to handle this issue correctly correctly and to be more performant:

In [4]: df = pd.DataFrame({'group': [1, 1, 2],
   ...:                            'category_string': pd.Series(list('abc')).astype('category'),
   ...:                            'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern')})
   ...:

In [5]: df
Out[5]:
   group category_string                datetimetz
0      1               a 2013-01-01 00:00:00-05:00
1      1               b 2013-01-02 00:00:00-05:00
2      2               c 2013-01-03 00:00:00-05:00

In [6]: df.groupby('group').first() #wrong
Out[6]:
      category_string          datetimetz
group
1                   a 2013-01-01 05:00:00
2                   c 2013-01-03 05:00:00

In [7]: df.groupby('group').nth(0) # correct
Out[7]:
      category_string                datetimetz
group
1                   a 2013-01-01 00:00:00-05:00
2                   c 2013-01-03 00:00:00-05:00

In [8]: %timeit df.groupby('group').nth(0)
2.52 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit df.groupby('group').first()
14.8 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

like how head/tail is just a wrapper around iloc. This may be a win/win unless first/last has a feature that nth doesn't have?

mroeschke on 23 Jun 2018

first/last i think could be in terms of nth (which came later) - nan handling is the same i think (that’s the defining issue and how they r different from head/tail)

jreback on 23 Jun 2018

If you take the compat out of the equation first is really just nth passing 1 as n:

https://github.com/pandas-dev/pandas/blob/f1aa08c4631aea435d66c161557e44f7c72a618e/pandas/core/groupby/groupby.py#L2411

last is a separate implementation in Cython. It's a little different than first because with first n will always be 1, but with last n could vary across each group. Perhaps an intelligent consolidation could still occur back in Cython. Here's the implementation of that (nth is in the same module):

https://github.com/pandas-dev/pandas/blob/f1aa08c4631aea435d66c161557e44f7c72a618e/pandas/_libs/groupby_helper.pxi.in#L300

WillAyd on 23 Jun 2018

Not sure if this is the right test, but it looks like nth supports negative indexing so shouldn't last always be -1?

In [3]: df.groupby('group').nth(-1)
Out[3]:
      category_string                datetimetz
group
1                   b 2013-01-02 00:00:00-05:00
2                   c 2013-01-03 00:00:00-05:00

@aggarwalvinayak what you may want to do then is start from @WillAyd comments above and try to investigate if its possible to implement first/last in terms of the nth method.

mroeschke on 23 Jun 2018

👍2

@aggarwalvinayak welcome to still work on this but just as a heads up I removed the "good first issue" tag as this is a little more complicated touching on Cython code

WillAyd on 24 Jun 2018

@WillAyd I am not at all experienced with cython.. Will try to explore about that first.. Because this is my second issue that the good first issue tag was removed because of cython thing.

aggarwalvinayak on 24 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings