In [1]: import pandas as pd
...: import numpy as np
...: print(pd.__version__)
...:
...: time = pd.to_timedelta(np.random.randint(100,1000, size=10), unit='s')
...: cat = np.random.choice(['A','B','C'], size=10)
...:
...: df = pd.DataFrame(dict(time=time, cat=cat))
...:
...: print(df['time'].mean())
...:
...: df.groupby('cat')['time'].mean()
...:
0.20.3
0 days 00:07:15.800000
---------------------------------------------------------------------------
DataError Traceback (most recent call last)
<ipython-input-1-b649ebd78333> in <module>()
10 print(df['time'].mean())
11
---> 12 df.groupby('cat')['time'].mean()
/Users/adefusco/Applications/anaconda3/4.3/envs/module-pandas/lib/python3.6/site-packages/pandas/core/groupby.py in mean(self, *args, **kwargs)
1035 nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
1036 try:
-> 1037 return self._cython_agg_general('mean', **kwargs)
1038 except GroupByError:
1039 raise
/Users/adefusco/Applications/anaconda3/4.3/envs/module-pandas/lib/python3.6/site-packages/pandas/core/groupby.py in _cython_agg_general(self, how, alt, numeric_only)
834
835 if len(output) == 0:
--> 836 raise DataError('No numeric types to aggregate')
837
838 return self._wrap_aggregated_output(output, names)
DataError: No numeric types to aggregate
The mean of TimeDelta
can be computed, but it is not working in GroupBy.
This is the output I would expect.
In [3]: def average_time(x):
...: s = x.dt.seconds
...: return pd.Timedelta(s.mean(), unit='s')
...:
...: df.groupby('cat')['time'].apply(average_time)
...:
Out[3]:
cat
A 00:09:27.666667
B 00:06:18.666667
C 00:06:19.750000
Name: time, dtype: timedelta64[ns]
pd.show_versions()
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.2.1
I think this is just because we don't have a cython implementation of group_mean
in https://github.com/pandas-dev/pandas/blob/ad7d6fc0248edaf098537e5674dcc0c9dd059491/pandas/_libs/groupby_helper.pxi.in#L31 if you're interested in taking a look
FYI, a slightly smaller workaround is df.groupby('cat')['time'].agg(lambda x: x.mean())
Oh, it might be a bit more involved than that, since it looks like we call ._get_numeric_data
, which excludes timedeltas.
The cython path will handle timedeltas as below
In [47]: df.groupby('cat')['time'].mean(numeric_only=False)
Out[47]:
cat
A 00:09:46.666666
B 00:10:02.750000
Name: time, dtype: timedelta64[ns]
Arguably not a great API - we automatically exclude "non-numeric" columns on groupby ops. This makes sense with an entire DataFrame, but perhaps on a Series selection this should default to true?
@chris-b1 Just tried this on my dataframe, and it does not give me correct results, I think it's because it handles NaT
incorrectly (it gives me negative Timedelta from a dataframe containing only positive Timedelta and NaT
).
This same issue is occurring when using .resample()
. I'm guessing that uses GroupBy in some fashion, but in case it does not, I thought I'd mention it here.
Also, @chris-b1 's workaround worked just fine for me.
It seems to make sense to convert timestamp column to either days or seconds by using the inbuilt functions to get the mean.
FWIW (another workaround), you can use ✅ ...agg({'duration': pd.Series.mean})
instead of ❌ ...agg({'duration': 'mean'})
or ❌ ...agg({'duration': np.mean})
(where duration
is a timedelta64
-typed column).
this is still an issue for me too, left my head spinning as I was able to get the mean | median values using describe etc. but then when pivoting and group_by started to get numeric values not found
error. Though solutions above are good workarounds
Most helpful comment
I think this is just because we don't have a cython implementation of
group_mean
in https://github.com/pandas-dev/pandas/blob/ad7d6fc0248edaf098537e5674dcc0c9dd059491/pandas/_libs/groupby_helper.pxi.in#L31 if you're interested in taking a lookFYI, a slightly smaller workaround is
df.groupby('cat')['time'].agg(lambda x: x.mean())