See http://stackoverflow.com/questions/20625982/split-apply-combine-on-pandas-timedelta-column
related as well: http://stackoverflow.com/questions/20789976/python-pandas-dataframe-1st-line-issue-with-datetime-timedelta/20802902#20802902
I have a DataFrame with a column of timedeltas (actually upon inspection the dtype is timedelta64[ns]
or '<m8[ns]'
), and I'd like to do a split-combine-apply, but the timedelta column is being dropped:
import pandas as pd
import numpy as np
pd.__version__
Out[3]: '0.13.0rc1'
np.__version__
Out[4]: '1.8.0'
data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])
data['td'] *= 10000000
data['td'] = pd.Series(data['td'], dtype='<m8[ns]')
data
Out[8]:
f1 f2 td
0 0.990140 0.948313 00:00:00.003066
1 0.277125 0.993549 00:00:00.001443
2 0.016427 0.581129 00:00:00.009257
3 0.048662 0.512215 00:00:00.000702
4 0.846301 0.179160 00:00:00.000396
5 0.568323 0.419887 00:00:00.000266
6 0.328182 0.919897 00:00:00.006138
7 0.292882 0.213219 00:00:00.008876
8 0.623332 0.003409 00:00:00.000322
9 0.650436 0.844180 00:00:00.006873
[10 rows x 3 columns]
data.groupby(data.index < 5).mean()
Out[9]:
f1 f2
False 0.492631 0.480118
True 0.435731 0.642873
[2 rows x 2 columns]
Or, forcing pandas to try the operation on the 'td' column:
data.groupby(data.index < 5)['td'].mean()
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-12-88cc94e534b7> in <module>()
----> 1 data.groupby(data.index < 5)['td'].mean()
/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in mean(self)
417 """
418 try:
--> 419 return self._cython_agg_general('mean')
420 except GroupByError:
421 raise
/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in _cython_agg_general(self, how, numeric_only)
669
670 if len(output) == 0:
--> 671 raise DataError('No numeric types to aggregate')
672
673 return self._wrap_aggregated_output(output, names)
DataError: No numeric types to aggregate
However, taking the mean of the column works fine, so numeric operations should be possible:
data['td'].mean()
Out[11]:
0 00:00:00.003734
dtype: timedelta64[ns]
Worth mentioning that DataFrame's mean doesn't do this either, I think it should (convert those originally datelike to date):
In [11]: data.mean()
Out[11]:
f1 0.528609
f2 0.583264
td 5975598.700000
dtype: float64
Compare to data.mean(1)
which ignores date columns (correctly imo as you're going across dtypes).
Interestingly, mean
works on Series
but not DataFrame
(this is using the new timedelta formatting code #5701):
In [10]: pd.to_timedelta(list(range(5)), unit='D')
Out[10]:
0 0 days
1 1 days
2 2 days
3 3 days
4 4 days
dtype: timedelta64[ns]
In [9]: pd.to_timedelta(list(range(5)), unit='D').mean()
Out[9]:
0 2 days
dtype: timedelta64[ns]
In [6]: pd.DataFrame(pd.to_timedelta(list(range(5)), unit='D')).mean()
Out[6]:
0 55785 days, 14:53:21.659060
dtype: timedelta64[ns]
@hayd actually the results of data.mean()
are correct. The result for _td_ is in nanoseconds. It IS however possible to return an object array that is correct, .eg.
In [34]: Series([0.1,0.2,timedelta(10)])
Out[34]:
0 0.1
1 0.2
2 10 days, 0:00:00
dtype: object
so @hayd maybe create a separate issue for this type of inference (Its just a bit of inference detection in nanops.py/_reduce)
@hsharrison as far as the groupby; this is just not implemented ATM in groupby.py; its not that difficult, just needs to follow basically what datetime64 stuff does
See #6884.
Workaround: Use .describe()
(which includes the mean) rather than .mean()
See https://gist.github.com/tomfitzhenry/d36ebba697a1f3eeefcb for demo.
Wow. Any insight into why that works? I would not have expected a convenience method to take a different code path.
Also has the same problem. @tomfitzhenry 's solution works.
Seeing as this keeps getting kicked to the next release, you can also perform sum
and count
. Referencing @hsharrison's original dataframe from the bug report:
data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])
data['td'] *= 10000000
data['td'] = pd.Series(data['td'], dtype='<m8[ns]')
grouped_df = data.groupby(data.index < 5)
mean_df = grouped_df.mean()
In [14]: mean_df
Out[14]:
f1 f2
False 0.271488 0.614299
True 0.535522 0.476918
mean_df['td'] = grouped_df['td'].sum() / grouped_df['td'].count()
In [16]: mean_df
Out[16]:
f1 f2 td
False 0.271488 0.614299 00:00:00.004187
True 0.535522 0.476918 00:00:00.003278
This is still an issue in version 0.24.2 with both datetime64 and timedelta -type columns. If I should truncate any part of the below very long post, please let me know.
Calling np.mean via apply on timedelta with chaining works, but calling mean in the aggregation function fails:
In [11]: pd.__version__
Out[11]: '0.24.2'
In [12]: np.__version__
Out[12]: '1.16.2'
In [1]: import numpy as np
...: import pandas as pd
...: import datetime
In [2]: start_dates = pd.to_datetime([datetime.datetime(2019,1,i) for i in range(1,5)])
...:
In [3]: end_dates = pd.to_datetime([datetime.datetime(2019,1,i) for i in np.random.randint(4,17,4)])
...:
In [4]: df = pd.DataFrame({'start': start_dates,'end':end_dates,'ID':[1,1,2,2]})
...:
In [5]: df['delta'] = df['end'] - df['start']
...:
In [6]: df
Out[6]:
start end ID delta
0 2019-01-01 2019-01-15 1 14 days
1 2019-01-02 2019-01-04 1 2 days
2 2019-01-03 2019-01-04 2 1 days
3 2019-01-04 2019-01-06 2 2 days
In [7]: df.groupby('ID').agg({'delta':'mean'})
...:
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-7-4b906de9b470> in <module>
----> 1 df.groupby('ID').agg({'delta':'mean'})
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
1313 @Appender(_shared_docs['aggregate'])
1314 def aggregate(self, arg, *args, **kwargs):
-> 1315 return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
1316
1317 agg = aggregate
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
184
185 _level = kwargs.pop('_level', None)
--> 186 result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
187 if how is None:
188 return result
/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
496
497 try:
--> 498 result = _agg(arg, _agg_1dim)
499 except SpecificationError:
500
/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg(arg, func)
447 result = compat.OrderedDict()
448 for fname, agg_how in compat.iteritems(arg):
--> 449 result[fname] = func(fname, agg_how)
450 return result
451
/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg_1dim(name, how, subset)
430 raise SpecificationError("nested dictionary is ambiguous "
431 "in aggregation")
--> 432 return colg.aggregate(how, _level=(_level or 0) + 1)
433
434 def _agg_2dim(name, how):
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
758 _level = kwargs.pop('_level', None)
759 if isinstance(func_or_funcs, compat.string_types):
--> 760 return getattr(self, func_or_funcs)(*args, **kwargs)
761
762 if isinstance(func_or_funcs, compat.Iterable):
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
1130 nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
1131 try:
-> 1132 return self._cython_agg_general('mean', **kwargs)
1133 except GroupByError:
1134 raise
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
836
837 if len(output) == 0:
--> 838 raise DataError('No numeric types to aggregate')
839
840 return self._wrap_aggregated_output(output, names)
DataError: No numeric types to aggregate
And with datetime64 series:
In [8]: df.groupby('ID').agg({'start':'mean'})
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-8-81a3957f4666> in <module>
----> 1 df.groupby('ID').agg({'start':'mean'})
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
1313 @Appender(_shared_docs['aggregate'])
1314 def aggregate(self, arg, *args, **kwargs):
-> 1315 return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
1316
1317 agg = aggregate
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
184
185 _level = kwargs.pop('_level', None)
--> 186 result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
187 if how is None:
188 return result
/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
496
497 try:
--> 498 result = _agg(arg, _agg_1dim)
499 except SpecificationError:
500
/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg(arg, func)
447 result = compat.OrderedDict()
448 for fname, agg_how in compat.iteritems(arg):
--> 449 result[fname] = func(fname, agg_how)
450 return result
451
/anaconda3/lib/python3.7/site-packages/pandas/core/base.py in _agg_1dim(name, how, subset)
430 raise SpecificationError("nested dictionary is ambiguous "
431 "in aggregation")
--> 432 return colg.aggregate(how, _level=(_level or 0) + 1)
433
434 def _agg_2dim(name, how):
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func_or_funcs, *args, **kwargs)
758 _level = kwargs.pop('_level', None)
759 if isinstance(func_or_funcs, compat.string_types):
--> 760 return getattr(self, func_or_funcs)(*args, **kwargs)
761
762 if isinstance(func_or_funcs, compat.Iterable):
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in mean(self, *args, **kwargs)
1130 nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
1131 try:
-> 1132 return self._cython_agg_general('mean', **kwargs)
1133 except GroupByError:
1134 raise
/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
836
837 if len(output) == 0:
--> 838 raise DataError('No numeric types to aggregate')
839
840 return self._wrap_aggregated_output(output, names)
DataError: No numeric types to aggregate
Dtypes:
In [14]: df.dtypes
Out[14]:
start datetime64[ns]
end datetime64[ns]
ID int64
delta timedelta64[ns]
dtype: object
As mentioned above, chaining with .apply and np.mean() works for timedelta only:
In [17]: df.groupby('ID')['delta'].apply(np.mean)
Out[17]:
ID
1 8 days 00:00:00
2 1 days 12:00:00
Name: delta, dtype: timedelta64[ns]
This same method fails with my 'start' and 'end' columns, with the error:
In [simulated]: df.groupby('ID')['start'].apply(np.mean)
Out[truncated]: DatetimeIndex cannot perform the operation mean
(the error trace for the above simulated function call is extremely long, therefor omitted)
@sylvanosullivan You should be able to do the aggregate by setting numeric_only=False
in the group by.
df = pd.DataFrame({
'td': pd.Series([pd.Timedelta(days=i) for i in range(5)]),
'group': ['a', 'a', 'a', 'b', 'b']
})
(
df
.groupby('group')
.mean(numeric_only=False)
)
Most helpful comment
Seeing as this keeps getting kicked to the next release, you can also perform
sum
andcount
. Referencing @hsharrison's original dataframe from the bug report: