Pandas: numeric_only inconsistency with pandas Series

Created on 1 Jul 2015 · 11Comments · Source: pandas-dev/pandas

In [1]: import pandas as pd

In [2]: pd.Series([1,2,3]).sum(numeric_only=False)
Out[2]: 6

In [3]: pd.Series([1,2,3]).sum(numeric_only=True)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-3-2c46bd289e26> in <module>()
----> 1 pd.Series([1,2,3]).sum(numeric_only=True)

/users/is/whughes/pyenvs/research/lib/python2.7/site-packages/pandas-0.16.2_ahl1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
   4253                                               skipna=skipna)
   4254                 return self._reduce(f, name, axis=axis,
-> 4255                                     skipna=skipna, numeric_only=numeric_only)
   4256             stat_func.__name__ = name
   4257             return stat_func

/users/is/whughes/pyenvs/research/lib/python2.7/site-packages/pandas-0.16.2_ahl1-py2.7-linux-x86_64.egg/pandas/core/series.pyc in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   2081             if numeric_only:
   2082                 raise NotImplementedError(
-> 2083                     'Series.{0} does not implement numeric_only.'.format(name))
   2084             return op(delegate, skipna=skipna, **kwds)
   2085 

NotImplementedError: Series.sum does not implement numeric_only.

API Design Compat Docs

Source

Wilfred

Most helpful comment

@jreback Why did you close this pull request? This is still not in documentation.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

sergei-bondarenko on 1 Jan 2018

👍5

All 11 comments

The docstring suggests this is a legitimate argument:

Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0)}
skipna : boolean, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA
level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a scalar
numeric_only : boolean, default None
    Include only float, int, boolean data. If None, will attempt to use
    everything, then use only numeric data

Returns
-------
sum : scalar or Series (if level specified)

However, strangely, there's an explicit test that this throws an exception: https://github.com/pydata/pandas/blob/054821dc90ded4263edf7c8d5b333c1d65ff53a4/pandas/tests/test_series.py#L2724

Wilfred on 1 Jul 2015

this is just for compat as its a general parameter that matters for DataFrames. (and the function is auto-generated). If you can find a way to not-expose it without jumping thru hoops would be ok.

jreback on 1 Jul 2015

OK, so numeric_only is accepted by Series.sum simply for compatibility with DataFrame.sum. You're proposing we find a way to hide this specific parameter in the docstring.

Have I understood correctly?

Wilfred on 1 Jul 2015

jreback on 1 Jul 2015

I'll freely admit I'm a pandas novice, but I ran headlong into what I think was this bug just now. I wanted numeric_only with Series.mean rather than sum; I assume that falls under this issue as well. The documentation says this option exists but the code says it doesn't. pandas version 0.18.1, documentation from a matching-version manual (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) (although obviously that link may age out).

smlewis on 27 Sep 2016

@smlewis - can you show an example of some data where you needed this and what you you expected to happen? Note that the implemented usecase is for selecting numeric _columns_, like

df = pd.DataFrame({'a': [2,3,4], 'b': pd.timedelta_range('1s', periods=3)})

df
Out[63]: 
   a               b
0  2 0 days 00:00:01
1  3 1 days 00:00:01
2  4 2 days 00:00:01

df.mean()
Out[65]: 
a                  3
b    1 days 00:00:01
dtype: object

df.mean(numeric_only=True)
Out[64]: 
a    3.0
dtype: float64

chris-b1 on 27 Sep 2016

The input file for my dataframe was constructed in a stupid way (by me...): several similar data sources were concatenated so I could process their averages all at once instead of running the script N times. The concatenation meant that each group had its header repeated (except the first, which I'd edited manually to properly name the column; that column was a mangling of the source filename inserted at concatenation time). So you get a data set like this:

source  score 
alpha   2 
alpha   3 
alpha   2 
beta    score 
beta    9 
beta    8 
beta    7 
gamma   score 
gamma   4 
gamma   4 
gamma   1

This snippet:

import pandas as pd

all_scores = pd.read_csv("scores_for_averaging.csv", delim_whitespace=True)

experiments = all_scores['source'].unique()

for each in experiments:
    exp_slice = all_scores.loc[all_scores['source'] == each]
    #print each, exp_slice['score'].mean(numeric_only=True) #fails: NotImplementedError: Series.mean does not implement numeric_only.
    #print each, exp_slice['score'].mean() #fails: TypeError: Could not convert score987 to numeric

failed because mean() couldn't accept numeric_only to throw out the spurious extra header line for beta, gamma, etc. I just reprocessed my input to not have the header line repeated and then it worked fine. I guess the problem is that the documentation and the code don't match?

smlewis on 27 Sep 2016

Thanks, just curious what the expected use was. Yes, the documentation/method should be updated to match, just tricky to actually do in this case (PR welcome!).

FYI, for a conversion like this (assuming you actually do have a valid mixed type object array), the function you likely want is to_numeric

pd.to_numeric(exp_slice['score'], errors='coerce').mean()

chris-b1 on 27 Sep 2016

👍2

I suppose this could be better documented, but the arg is there for consistency with DataFrame. It really doesn't do anything as a Series is a single dtyped object. Either you get all elements or None (even if mixed). We don't deeply introspect mixed (or object) things.

jreback on 28 Sep 2016

Thank you!

smlewis on 28 Sep 2016

@jreback Why did you close this pull request? This is still not in documentation.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

sergei-bondarenko on 1 Jan 2018

👍5

Was this page helpful?

0 / 5 - 0 ratings