Pandas: numeric_only inconsistency with pandas Series

Created on 1 Jul 2015  路  11Comments  路  Source: pandas-dev/pandas

In [1]: import pandas as pd

In [2]: pd.Series([1,2,3]).sum(numeric_only=False)
Out[2]: 6

In [3]: pd.Series([1,2,3]).sum(numeric_only=True)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-3-2c46bd289e26> in <module>()
----> 1 pd.Series([1,2,3]).sum(numeric_only=True)

/users/is/whughes/pyenvs/research/lib/python2.7/site-packages/pandas-0.16.2_ahl1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
   4253                                               skipna=skipna)
   4254                 return self._reduce(f, name, axis=axis,
-> 4255                                     skipna=skipna, numeric_only=numeric_only)
   4256             stat_func.__name__ = name
   4257             return stat_func

/users/is/whughes/pyenvs/research/lib/python2.7/site-packages/pandas-0.16.2_ahl1-py2.7-linux-x86_64.egg/pandas/core/series.pyc in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   2081             if numeric_only:
   2082                 raise NotImplementedError(
-> 2083                     'Series.{0} does not implement numeric_only.'.format(name))
   2084             return op(delegate, skipna=skipna, **kwds)
   2085 

NotImplementedError: Series.sum does not implement numeric_only.
API Design Compat Docs

Most helpful comment

@jreback Why did you close this pull request? This is still not in documentation.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

All 11 comments

The docstring suggests this is a legitimate argument:

Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA
level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a scalar
numeric_only : boolean, default None
    Include only float, int, boolean data. If None, will attempt to use
    everything, then use only numeric data

Returns
-------
sum : scalar or Series (if level specified)

However, strangely, there's an explicit test that this throws an exception: https://github.com/pydata/pandas/blob/054821dc90ded4263edf7c8d5b333c1d65ff53a4/pandas/tests/test_series.py#L2724

this is just for compat as its a general parameter that matters for DataFrames. (and the function is auto-generated). If you can find a way to not-expose it without jumping thru hoops would be ok.

OK, so numeric_only is accepted by Series.sum simply for compatibility with DataFrame.sum. You're proposing we find a way to hide this specific parameter in the docstring.

Have I understood correctly?

Ok

I'll freely admit I'm a pandas novice, but I ran headlong into what I think was this bug just now. I wanted numeric_only with Series.mean rather than sum; I assume that falls under this issue as well. The documentation says this option exists but the code says it doesn't. pandas version 0.18.1, documentation from a matching-version manual (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) (although obviously that link may age out).

@smlewis - can you show an example of some data where you needed this and what you you expected to happen? Note that the implemented usecase is for selecting numeric _columns_, like

df = pd.DataFrame({'a': [2,3,4], 'b': pd.timedelta_range('1s', periods=3)})

df
Out[63]: 
   a               b
0  2 0 days 00:00:01
1  3 1 days 00:00:01
2  4 2 days 00:00:01

df.mean()
Out[65]: 
a                  3
b    1 days 00:00:01
dtype: object

df.mean(numeric_only=True)
Out[64]: 
a    3.0
dtype: float64

The input file for my dataframe was constructed in a stupid way (by me...): several similar data sources were concatenated so I could process their averages all at once instead of running the script N times. The concatenation meant that each group had its header repeated (except the first, which I'd edited manually to properly name the column; that column was a mangling of the source filename inserted at concatenation time). So you get a data set like this:

source  score 
alpha   2 
alpha   3 
alpha   2 
beta    score 
beta    9 
beta    8 
beta    7 
gamma   score 
gamma   4 
gamma   4 
gamma   1 

This snippet:

import pandas as pd

all_scores = pd.read_csv("scores_for_averaging.csv", delim_whitespace=True)

experiments = all_scores['source'].unique()

for each in experiments:
    exp_slice = all_scores.loc[all_scores['source'] == each]
    #print each, exp_slice['score'].mean(numeric_only=True) #fails: NotImplementedError: Series.mean does not implement numeric_only.
    #print each, exp_slice['score'].mean() #fails: TypeError: Could not convert score987 to numeric

failed because mean() couldn't accept numeric_only to throw out the spurious extra header line for beta, gamma, etc. I just reprocessed my input to not have the header line repeated and then it worked fine. I guess the problem is that the documentation and the code don't match?

Thanks, just curious what the expected use was. Yes, the documentation/method should be updated to match, just tricky to actually do in this case (PR welcome!).

FYI, for a conversion like this (assuming you actually do have a valid mixed type object array), the function you likely want is to_numeric

pd.to_numeric(exp_slice['score'], errors='coerce').mean()

I suppose this could be better documented, but the arg is there for consistency with DataFrame. It really doesn't do anything as a Series is a single dtyped object. Either you get all elements or None (even if mixed). We don't deeply introspect mixed (or object) things.

Thank you!

@jreback Why did you close this pull request? This is still not in documentation.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

Was this page helpful?
0 / 5 - 0 ratings