Pandas: groupby.mean, etc, doesn't recognize timedelta64

Created on 17 Dec 2013  路  11Comments  路  Source: pandas-dev/pandas

See http://stackoverflow.com/questions/20625982/split-apply-combine-on-pandas-timedelta-column
related as well: http://stackoverflow.com/questions/20789976/python-pandas-dataframe-1st-line-issue-with-datetime-timedelta/20802902#20802902

I have a DataFrame with a column of timedeltas (actually upon inspection the dtype is timedelta64[ns] or '<m8[ns]'), and I'd like to do a split-combine-apply, but the timedelta column is being dropped:

import pandas as pd

import numpy as np

pd.__version__
Out[3]: '0.13.0rc1'

np.__version__
Out[4]: '1.8.0'

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

data
Out[8]: 
         f1        f2              td
0  0.990140  0.948313 00:00:00.003066
1  0.277125  0.993549 00:00:00.001443
2  0.016427  0.581129 00:00:00.009257
3  0.048662  0.512215 00:00:00.000702
4  0.846301  0.179160 00:00:00.000396
5  0.568323  0.419887 00:00:00.000266
6  0.328182  0.919897 00:00:00.006138
7  0.292882  0.213219 00:00:00.008876
8  0.623332  0.003409 00:00:00.000322
9  0.650436  0.844180 00:00:00.006873

[10 rows x 3 columns]

data.groupby(data.index < 5).mean()
Out[9]: 
             f1        f2
False  0.492631  0.480118
True   0.435731  0.642873

[2 rows x 2 columns]

Or, forcing pandas to try the operation on the 'td' column:

data.groupby(data.index < 5)['td'].mean()
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-12-88cc94e534b7> in <module>()
----> 1 data.groupby(data.index < 5)['td'].mean()

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in mean(self)
    417         """
    418         try:
--> 419             return self._cython_agg_general('mean')
    420         except GroupByError:
    421             raise

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in _cython_agg_general(self, how, numeric_only)
    669 
    670         if len(output) == 0:
--> 671             raise DataError('No numeric types to aggregate')
    672 
    673         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

However, taking the mean of the column works fine, so numeric operations should be possible:

data['td'].mean()
Out[11]: 
0   00:00:00.003734
dtype: timedelta64[ns]
Enhancement Groupby Timedelta

Most helpful comment

Seeing as this keeps getting kicked to the next release, you can also perform sum and count. Referencing @hsharrison's original dataframe from the bug report:

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

grouped_df = data.groupby(data.index < 5)

mean_df = grouped_df.mean()

In [14]: mean_df
Out[14]: 
             f1        f2
False  0.271488  0.614299
True   0.535522  0.476918

mean_df['td'] = grouped_df['td'].sum() / grouped_df['td'].count()

In [16]: mean_df
Out[16]: 
             f1        f2              td
False  0.271488  0.614299 00:00:00.004187
True   0.535522  0.476918 00:00:00.003278

All 11 comments

Worth mentioning that DataFrame's mean doesn't do this either, I think it should (convert those originally datelike to date):

In [11]: data.mean()
Out[11]: 
f1          0.528609
f2          0.583264
td    5975598.700000
dtype: float64

Compare to data.mean(1) which ignores date columns (correctly imo as you're going across dtypes).

Interestingly, mean works on Series but not DataFrame (this is using the new timedelta formatting code #5701):

In [10]: pd.to_timedelta(list(range(5)), unit='D')
Out[10]: 
0   0 days
1   1 days
2   2 days
3   3 days
4   4 days
dtype: timedelta64[ns]
In [9]: pd.to_timedelta(list(range(5)), unit='D').mean()
Out[9]: 
0   2 days
dtype: timedelta64[ns]
In [6]: pd.DataFrame(pd.to_timedelta(list(range(5)), unit='D')).mean()
Out[6]: 
0   55785 days, 14:53:21.659060
dtype: timedelta64[ns]

@hayd actually the results of data.mean() are correct. The result for _td_ is in nanoseconds. It IS however possible to return an object array that is correct, .eg.

In [34]: Series([0.1,0.2,timedelta(10)])
Out[34]: 
0                 0.1
1                 0.2
2    10 days, 0:00:00
dtype: object

so @hayd maybe create a separate issue for this type of inference (Its just a bit of inference detection in nanops.py/_reduce)

@hsharrison as far as the groupby; this is just not implemented ATM in groupby.py; its not that difficult, just needs to follow basically what datetime64 stuff does

See #6884.

Workaround: Use .describe() (which includes the mean) rather than .mean()

See https://gist.github.com/tomfitzhenry/d36ebba697a1f3eeefcb for demo.

Wow. Any insight into why that works? I would not have expected a convenience method to take a different code path.

Also has the same problem. @tomfitzhenry 's solution works.

Seeing as this keeps getting kicked to the next release, you can also perform sum and count. Referencing @hsharrison's original dataframe from the bug report:

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

grouped_df = data.groupby(data.index < 5)

mean_df = grouped_df.mean()

In [14]: mean_df
Out[14]: 
             f1        f2
False  0.271488  0.614299
True   0.535522  0.476918

mean_df['td'] = grouped_df['td'].sum() / grouped_df['td'].count()

In [16]: mean_df
Out[16]: 
             f1        f2              td
False  0.271488  0.614299 00:00:00.004187
True   0.535522  0.476918 00:00:00.003278

This is still an issue in version 0.24.2 with both datetime64 and timedelta -type columns. If I should truncate any part of the below very long post, please let me know.
Calling np.mean via apply on timedelta with chaining works, but calling mean in the aggregation function fails:

In [11]: pd.__version__                                                                                
Out[11]: '0.24.2'

In [12]: np.__version__                                                                                
Out[12]: '1.16.2'
In [1]: import numpy as np 
   ...: import pandas as pd 
   ...: import datetime                                                                                

In [2]: start_dates = pd.to_datetime([datetime.datetime(2019,1,i) for i in range(1,5)]) 
   ...:                                                                                                

In [3]: end_dates = pd.to_datetime([datetime.datetime(2019,1,i) for i in np.random.randint(4,17,4)]) 
   ...:                                                                                                

In [4]: df = pd.DataFrame({'start': start_dates,'end':end_dates,'ID':[1,1,2,2]}) 
   ...:                                                                                                

In [5]: df['delta'] = df['end'] - df['start'] 
   ...:                                                                                                

In [6]: df                                                                                             
Out[6]: 
       start        end  ID   delta
0 2019-01-01 2019-01-15   1 14 days
1 2019-01-02 2019-01-04   1  2 days
2 2019-01-03 2019-01-04   2  1 days
3 2019-01-04 2019-01-06   2  2 days

In [7]: df.groupby('ID').agg({'delta':'mean'}) 
   ...:                                                                                                
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-7-4b906de9b470> in <module>
----> 1 df.groupby('ID').agg({'delta':'mean'})

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1313     @Appender(_shared_docs['aggregate'])
   1314     def aggregate(self, arg, *args, **kwargs):
-> 1315         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   1316 
   1317     agg = aggregate

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
    184 
    185         _level = kwargs.pop('_level', None)
--> 186         result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
    187         if how is None:
    188             return result

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    496 
    497                 try:
--> 498                     result = _agg(arg, _agg_1dim)
    499                 except SpecificationError:
    500 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg(arg, func)
    447                 result = compat.OrderedDict()
    448                 for fname, agg_how in compat.iteritems(arg):
--> 449                     result[fname] = func(fname, agg_how)
    450                 return result
    451 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg_1dim(name, how, subset)
    430                     raise SpecificationError("nested dictionary is ambiguous "
    431                                              "in aggregation")
--> 432                 return colg.aggregate(how, _level=(_level or 0) + 1)
    433 
    434             def _agg_2dim(name, how):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
    758         _level = kwargs.pop('_level', None)
    759         if isinstance(func_or_funcs, compat.string_types):
--> 760             return getattr(self, func_or_funcs)(*args, **kwargs)
    761 
    762         if isinstance(func_or_funcs, compat.Iterable):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
   1130         nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
   1131         try:
-> 1132             return self._cython_agg_general('mean', **kwargs)
   1133         except GroupByError:
   1134             raise

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    836 
    837         if len(output) == 0:
--> 838             raise DataError('No numeric types to aggregate')
    839 
    840         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

And with datetime64 series:

In [8]: df.groupby('ID').agg({'start':'mean'})                                                         
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-8-81a3957f4666> in <module>
----> 1 df.groupby('ID').agg({'start':'mean'})

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1313     @Appender(_shared_docs['aggregate'])
   1314     def aggregate(self, arg, *args, **kwargs):
-> 1315         return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
   1316 
   1317     agg = aggregate

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
    184 
    185         _level = kwargs.pop('_level', None)
--> 186         result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
    187         if how is None:
    188             return result

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    496 
    497                 try:
--> 498                     result = _agg(arg, _agg_1dim)
    499                 except SpecificationError:
    500 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg(arg, func)
    447                 result = compat.OrderedDict()
    448                 for fname, agg_how in compat.iteritems(arg):
--> 449                     result[fname] = func(fname, agg_how)
    450                 return result
    451 

/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg_1dim(name, how, subset)
    430                     raise SpecificationError("nested dictionary is ambiguous "
    431                                              "in aggregation")
--> 432                 return colg.aggregate(how, _level=(_level or 0) + 1)
    433 
    434             def _agg_2dim(name, how):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
    758         _level = kwargs.pop('_level', None)
    759         if isinstance(func_or_funcs, compat.string_types):
--> 760             return getattr(self, func_or_funcs)(*args, **kwargs)
    761 
    762         if isinstance(func_or_funcs, compat.Iterable):

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
   1130         nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
   1131         try:
-> 1132             return self._cython_agg_general('mean', **kwargs)
   1133         except GroupByError:
   1134             raise

/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
    836 
    837         if len(output) == 0:
--> 838             raise DataError('No numeric types to aggregate')
    839 
    840         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

Dtypes:

In [14]: df.dtypes                                                                                     
Out[14]: 
start     datetime64[ns]
end       datetime64[ns]
ID                 int64
delta    timedelta64[ns]
dtype: object

As mentioned above, chaining with .apply and np.mean() works for timedelta only:

In [17]: df.groupby('ID')['delta'].apply(np.mean)                                                      
Out[17]: 
ID
1   8 days 00:00:00
2   1 days 12:00:00
Name: delta, dtype: timedelta64[ns]

This same method fails with my 'start' and 'end' columns, with the error:

In [simulated]: df.groupby('ID')['start'].apply(np.mean)    
Out[truncated]: DatetimeIndex cannot perform the operation mean

(the error trace for the above simulated function call is extremely long, therefor omitted)

@sylvanosullivan You should be able to do the aggregate by setting numeric_only=False in the group by.

df = pd.DataFrame({
    'td': pd.Series([pd.Timedelta(days=i) for i in range(5)]),
    'group': ['a', 'a', 'a', 'b', 'b']
})

(
    df
    .groupby('group')
    .mean(numeric_only=False)
)
Was this page helpful?
0 / 5 - 0 ratings