Pandas: Sum of all NA and empty should be 0

Created on 7 Dec 2017  Â·  39Comments  Â·  Source: pandas-dev/pandas

And the prod of those should be 1.

xref: https://github.com/pandas-dev/pandas/issues/9422, https://github.com/pandas-dev/pandas/pull/17630, https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html

We need to implement that and design and implement the alternative (either a new method, or a keyword-argument to sum and prod).

API Design Numeric

All 39 comments

I've been following this conversation a bit and remain confused on one point. Options 1 and 2 both appear to rely on the premise that Series([nan, nan]).sum() _should_ equal Series([]).sum().
Without having an intuition for why this is a desirable behavior, the merits of those two options hard to evaluate. Is it just consistency with [other software]? LMK if this question belongs elsewhere.

Here's fine. I think the shortest answer is that it's nice to have s.sum(skipna=True) equal s.dropna().sum(skipna=True/False).

I would keep the more general discussion on the mailing list, and here more the details of the changes we then want to make

this is not happening for 0.21.1

this is not happening for 0.21.1

I think the question is rather: do we want to release this quickly, or only in a few months? (and then once answered, we can still discuss whether we call that 0.21.1, 0.21.2 or 0.22.0). But let's keep that discussion in https://github.com/pandas-dev/pandas/issues/18244#issuecomment-350000655

Just to add a data point I tried to update to .21 and a large number of our unit tests failed so I came looking to see if this had been reported as a bug yet. A lot of our code assumes pd.Series([]).sum() == 0 for simple things like dollar weighted sums.

We won't be upgrading until this is fixed and the fact a breaking change like this was made has lead us to seriously consider prohibiting pandas use within critical sections of our code base.

@ryanbressler that discussion is happening on the mailing list: https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html

We're using the issue to design the fix.

I've started a branch for this at https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:na-sum. I'll have a PR tomorrow, but briefly my approach was to add a parameter to sum and prod: empty_is_na. Here's the docstring for sum:

Signature: df.sum(axis=None, skipna=True, level=None, numeric_only=None, empty_is_na=False, **kwargs)
Docstring:
Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values before computing the result.
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a Series
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, then use only numeric data. Not implemented for
    Series.
empty_is_na : bool, default False
    The result of operating on an empty array should be NA. The default
    behavior is for the sum of an empty array to be 0, and the product
    of an empty array to be 1.

Returns
-------
sum : Series or DataFrame (if level specified)
File:      ~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/generic.py
Type:      method
In [3]: pd.Series().sum()
Out[3]: 0.0

In [4]: pd.Series().sum(empty_is_na=True)
Out[4]: nan

@ryanbressler would you like to voice this experience on the mailing list?

@TomAugspurger and @jorisvandenbossche it looks like I need to subscribe to that list to post? I'll do that after some meetings I have this morning. Feel free to delete my comments If they aren't appropriate on this thread but I must say expecting users to join the dev discussion list to report issues feels a bit unfriendly.

I must say expecting users to join the dev discussion list to report issues feels a bit unfriendly.

Not to report issues, that's just already been done -- this issue :)

We're trying to keep the discussion from fragmenting. Mailing List for
discussion what behavior we want, this issue for designing that behavior.
And usually most discussion happens on GitHub. This happens to be on the
mailing list since it's of broad interest, and not everybody
can keep up with the traffic on the GitHub repository.

On Tue, Dec 12, 2017 at 12:28 PM, Ryan Bressler notifications@github.com
wrote:

@TomAugspurger https://github.com/tomaugspurger and @jorisvandenbossche
https://github.com/jorisvandenbossche it looks like I need to subscribe
to that list to post? I'll do that after some meetings I have this morning.
Feel free to delete my comments If they aren't appropriate on this thread
but I must say expecting users to join the dev discussion list to report
issues feels a bit unfriendly.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18678#issuecomment-351142958,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIoYBQ1mE554c8WRq6GjpNRJcicUqks5s_sXJgaJpZM4Q5w6R
.

@TomAugspurger regarding your proposal above, that would only deal with the empty case? So that would not yet provide an option for returning NaN on an all NaN series?

Yep, so far just handling the empty case. My preference is for additional keywords to prod and sum, rather than a new method.

For the all-NA case,

Signature: df.sum(axis=None, skipna=True, level=None, numeric_only=None, empty_is_na=False, all_na_is_na=False, **kwargs)
Docstring:
Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values before computing the result.
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a Series
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, then use only numeric data. Not implemented for
    Series.
empty_is_na : bool, default False
    The result of operating on an empty array should be NA. The default
    behavior is for the sum of an empty array to be 0, and the product
    of an empty array to be 1.
all_na_is_na : bool, default False
    The result of operating on an all-NA array should be NA. The default
    behavior is for the sum of an all-NA array to be 0, and the product
    of an all-NA array to be 1.

Returns
-------
sum : Series or DataFrame (if level specified)
File:      ~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/generic.py
Type:      method

Are there any problematic interactions between these new keywords and skipna? Does the empty_is_na apply to it being empty before or after skipping the NA values? I'd propose that it applies to the before-skipping version, so that empty_is_na only has an effect when the original series is length-0.

So for an all-NA series, s = pd.Series([np.nan])

  1. s.sum(skipna=True, empty_is_na=False, all_na_is_na=False) -> 0
  2. s.sum(skipna=True, empty_is_na=True, all_na_is_na=True) -> NaN
  3. s.sum(skipna=True, empty_is_na=True, all_na_is_na=False) -> 0
  4. s.sum(skipna=True, empty_is_na=False, all_na_is_na=True) -> NaN

Now skipna=False. How should we handle all_na_is_na?

  1. s.sum(skipna=False, empty_is_na=False, all_na_is_na=False) -> 0 (this differs from np.sum)
  2. s.sum(skipna=False empty_is_na=True, all_na_is_na=True) -> NaN
  3. s.sum(skipna=False, empty_is_na=True, all_na_is_na=False) -> 0 (this differs from np.sum)
  4. s.sum(skipna=False, empty_is_na=False, all_na_is_na=True) -> NaN

1 and 3 concern me a bit. I'll think about it a bit more.

For an empty series, the only thing that matters is empty_is_na. With empty_is_na=False, the return value is 0. With empty_is_na=True, the return value is NA, regardless of skipna and all_na_is_na.

My preference is for additional keywords to prod and sum, rather than a new method.

I am personally also in favor of additional keyword(s) instead of a new method. I don't really like the empty_is_na and all_na_is_na names, and that fact that it are two additional keywords, but I assume they are at least explicit and allow for controlling both cases separately (which might be difficult with a single keyword like min_count ?)

s.sum(skipna=False, empty_is_na=False, all_na_is_na=False) -> 0 (this differs from np.sum)

This one concerns me as well. I think the all_na_is_na only makes sense if you are skipping NAs. I can't really imagine were you would want Series([1, NA]).sum() to return NA but Series([NA, NA]).sum() to return 0.
If skipna=False, any occurence of a NA should turn the result into NA. Also the deviation from numpy would be unfortunate (as now skipna=False basically gives you numpy's sum).

allow for controlling both cases separately (which might be difficult with a single keyword like min_count ?)

I think I talked myself out of the need for two keywords. My empty_is_na keyword should only have an effect if len(s) is 0, which means there isn't even any NaNs.

I'm not fond of empty_is_na / all_na_is_na either, but I find them easier to explain that min_count. Though making Series([np.nan], skipna=False) return NaN (which we want) will complicate the explanation of all_na_is_na, so perhaps they're not easier to explain after all.

Indeed, do we really need two new keywords?

I think empty_is_na should suffice, to satisfy both:

  1. pd.Series([np.nan]).sum(skipna=True, empty_is_na=True) -> NaN
  2. pd.Series([]).sum(skipna=True, empty_is_na=True) -> NaN

For skipna=False, which we only need for completeness:

  1. pd.Series([np.nan]).sum(skipna=False, empty_is_na=True) -> NaN
  2. pd.Series([]).sum(skipna=False, empty_is_na=True) -> NaN

(empty_is_na=False would always have an empty or all-NaN sum be 0.)

The potentially compelling argument in favor of two different keyword arguments is that that gives you a way to use the legacy sum from pandas <0.21 (without bottleneck). It would be an easy switch to use when migrating code, but even then I suspect there are only a very small number of applications that need to different behavior between an empty and all-NaN sum. If need be, I would prefer empty_is_na='legacy' to adding a separate all_na_is_na argument.

I would just add

fill_value=0 (if we really want option 1), or fill_value=None for option 2

then this would all work and be quite simple.

(empty_is_na=False would always have an empty or all-NaN sum be 0.)

This seems strange, right? And that differs from pandas 0.20.3 with or without bottleneck, which always returned 0 for pd.Series([]).sum(skipna=False)

EDIT: sorry I misread. Ignore this.

@jreback would your fill_value only apply when skipna=True? Otherwise, we'd break pd.Series([np.nan]).sum(skipna=False) -> NaN.

For reference, here's the 0.20.3 behavior on two series pd.Series([]) and pd.Series([np.nan])

use_bn skipna length result
case
0 True True 0 0.0
1 True True 1 0.0
2 True False 0 0.0
3 True False 1 NaN
4 False True 0 0.0
5 False True 1 NaN
6 False False 0 0.0
7 False False 1 NaN

import pandas as pd

import itertools

use_bn = [True, False]

skipna = [True, False]

series = [pd.Series(), pd.Series([np.nan])]

values = []

for bn, skip, s in itertools.product(use_bn, skipna, series):
    with pd.option_context("compute.use_bottleneck", bn):
        result = s.sum(skipna=skip)
        values.append((bn, skip, len(s), result))

r = pd.DataFrame(values, columns=['use_bn', 'skipna', 'length', 'result'])
r.index.name = 'case'

Could people give thoughts on this table for the desired behavior?

| case | skipna | length | empty_is_na | result |
| -----| ------ | ------ | ----------- | ------ |
| A | True | empty | False | 0 |
| B | True | empty | True | NaN |
| C | True | all-na | False | 0 |
| D | True | all-na | True | NaN? |
| E | False | empty | False | 0? |
| F | False | empty | True | NaN |
| G | False | all-na | False | NaN |
| H | False | all-na | True | NaN |

I think if we're OK with D being NaN, then we only need the one keyword (whatever it's called). The semantics for empty_is_na=True with skipna=True is then "Is the array empty after removing NaNs? If so, they result is NaN". If we want D to be 0, then we'll need another keyword, or a 'legacy' option.

With case D=NaN, I don't think we achieve our goal of supporting users who want "NaN if everything is NA" though.

@TomAugspurger I agree with the rule "Is the array empty after removing NaNs? If so, they result is NaN"

To me, this implies D=NaN, exactly as you reasoned.

The other missing entry in your table is Case E=0. This is the same as the sum of an empty array in NumPy (np.sum([])).

With case D=NaN, I don't think we achieve our goal of supporting users who want "NaN if everything is NA" though.

I'm confused. Isn't that exactly what D=NaN ensures? empty_is_na achieves the SQL-like behavior that (some) users pine for.

empty_is_na='legacy' would be for users who can't make up their mind between whether they're a DBA or a mathematician, and want a rule that makes sense "most" of the time. (Of course, that's the worst sort of consistency for APIs, because it suggests a behavior that doesn't hold in all edge cases.)

I'm confused. Isn't that exactly what D=NaN ensures? empty_is_na achieves the SQL-like behavior that (some) users pine for.

Sorry, I think I was staring at that table for too long and started going crazy :)

I'm not fond of empty_is_na / all_na_is_na either, but I find them easier to explain that min_count

+1

@jreback In the end, a empty_is_na=True/False or fill_value=np.nan/0 would be equivalent in behaviour I think, so it is a question of which name we like most.
But I think in this case "empty_is_na is easier to explain" also is valid for fill_value IMO. Because fill_value would actually be the value to return in case of empty data (after skipping the NAs or not). So it would rather be something like "empty_result_value".

@TomAugspurger I agree with the rule "Is the array empty after removing NaNs? If so, they result is NaN"

To me, this implies D=NaN, exactly as you reasoned.

I agree too.

empty_is_na='legacy' would be for users who can't make up their mind between whether they're a DBA or a mathematician

So that would provide the option of all-NA giving NA (so to be signaled you have a full NA column), but keep the empty sum 0.
Do we think this is a valid option to want? Because if so, we shouldn't call it "legacy" but something like "mixed" (although that certainly doesn't 'tell' you what it does neither).


Something else, regarding "extra keyword vs new method". The advantage of a separate method is that for people who consistently want to use the non-default option, it could be much easier / less verbose to have an alternative method instead to have to add emtpy_is_na=True all over their code base (of course they can make such an alternative function themselves, but then it is not a method on the dataframe).
(one problem with this, however, is that a possible name like .total() would be very confusing if we choose for option 1 as the default, as then total() would be option 2, which makes it the other way around as in SQL)

So that would provide the option of all-NA giving NA (so to be signaled you have a full NA column), but keep the empty sum 0.
Do we think this is a valid option to want?

In all the user feedback on this issue, I do not recall any requests for this mixed behavior. So I think we can safely leave it out.

The advantage of a separate method is that for people who consistently want to use the non-default option, it could be much easier / less verbose to have an alternative method instead to have to add emtpy_is_na=True all over their code base

Yes, I was thinking that this morning as well...

Turning to groupby with categoricals, if we implement option 1, we'll want to have categoricals with no observations return 0 for sum.

In [32]: df = pd.DataFrame({"A": pd.Categorical(['a', 'b', 'a'],
    ...:                                         categories=['a', 'b', 'c']),
    ...:                    'B': [1, 2, 1]})
    ...: result = df.groupby("A").B.sum()
    ...:

In [33]: result
Out[33]:
A
a    2
b    2
c    0
Name: B, dtype: int64

Currently, aggregations like np.nansum don't behave as I would expect:

In [3]: pd.__version__
Out[3]: '0.20.3'

In [8]: df.groupby("A").B.agg(lambda x: np.nansum(x))
Out[8]:
A
a    2.0
b    2.0
c    NaN
Name: B, dtype: float64

Perhaps custom aggregation functions aren't called on an empty array to see what the return type is?

Another (large) place this affects is upsampling:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.20.3'

In [3]: pd.Series(range(4), index=pd.date_range(start='2017', periods=4, freq='1H')).resample('30T').sum()
Out[3]:
2017-01-01 00:00:00    0.0
2017-01-01 00:30:00    NaN
2017-01-01 01:00:00    1.0
2017-01-01 01:30:00    NaN
2017-01-01 02:00:00    2.0
2017-01-01 02:30:00    NaN
2017-01-01 03:00:00    3.0
Freq: 30T, dtype: float64

On my branch:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.22.0.dev0+364.g70e9bf903.dirty'

In [3]: pd.Series(range(4), index=pd.date_range(start='2017', periods=4, freq='1H')).resample('30T').sum()
Out[3]:
2017-01-01 00:00:00    0
2017-01-01 00:30:00    0
2017-01-01 01:00:00    1
2017-01-01 01:30:00    0
2017-01-01 02:00:00    2
2017-01-01 02:30:00    0
2017-01-01 03:00:00    3
Freq: 30T, dtype: int64

which is I suspect what we want, but another thing I'll make sure to document.

which is I suspect what we want

Thoughts on this? Do we want .resample(higher_freq).sum() to introduce 0s or NaNs?

i am -1 i’m changing anything groupby / resample

sum of NaN is NaN; anything else is completely confusing

A sum of empty/all-NaN values in resample should be no different than a sum of empty/all-NaN values elsewhere. So that means if we make sum of empty/all-NaN be 0, then it should be 0 for resample, too.

The fact that it's not entirely clear whether resampling is over all-NaN or empty values is another indication that it is important to keep the empty sum and all-NaN sum consistent.

It would be extremely confusing to use different rules for resample/groupby than elsewhere.

well sum of all-NaN should be NaN as discussed many times
sure sum of empty can be 0

well sum of all-NaN should be NaN as discussed many times

I think the majority opinion in https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html is that the sum of all-NaN should be 0.

I agree we should be consistent, so if we choose for option empty/all-NaN to be 0 for normal sum, it should also be the case for resample/groupby.
Although I kind of liked the NaNs in the upsampling, and this could be one of the cases where this would actually pop up quite often in user code and have a possible larger impact on users when we change it. We could maybe consider raising a warning first that it will change (and can already pass the keyword to silence it)

I agree we should be consistent, so if we choose for option empty/all-NaN
to be 0 for normal sum, it should also be the case for resample/groupby.
Although I kind of liked the NaNs in the upsampling

I am similarly conflicting on this one, which is why I re-reaised it before
the call tomorrow.

On Tue, Dec 19, 2017 at 2:55 PM, Joris Van den Bossche <
[email protected]> wrote:

I agree we should be consistent, so if we choose for option empty/all-NaN
to be 0 for normal sum, it should also be the case for resample/groupby.
Although I kind of liked the NaNs in the upsampling, and this could be one
of the cases where this would actually pop up quite often in user code and
have a possible larger impact on users when we change it. We could
maybe consider raising a warning first that it will change (and can already
pass the keyword to silence it)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18678#issuecomment-352883691,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIo5Y7_tI0sZVTsCh8k8-7nxR1ye1ks5tCCLPgaJpZM4Q5w6R
.

Dipping my toes into this: I personally would be in favor of a separate function along similar lines to what @jorisvandenbossche said above.

In addition, the difference between the two sum / prod functions would be handling of these zero-NaN cases discussed in the issue, while the "core" (private) sum / prod functionality would handle all other cases (that's my current thinking ATM)

One might consider calling these functions np_sum and np_prod because the results would mirror those from when you call np.array([]).sum and np.array([nan]).sum() depending on whether you want to skip-over NaN or not.

For example, if you want all-NaN to sum to NaN, pass in skipna=False. Otherwise, the sum is zero, as the element set that you're summing over would technically be empty.

@TomAugspurger to avoid confusion, should rename v0.22.0 in whatsnew to v0.23.0 and add a new v0.22.0

Thoughts on this? Do we want .resample(higher_freq).sum() to introduce 0s or NaNs?

I just came across this behaviour today, and I was surprised/annoyed I didn't get 0s: because (as I maybe repeated too much already) that's what I expected, but also because I needed to build a histogram, so I had to manually fillna() (and astype(int), as I was working with integers). If I had wanted NaNs, it would have been more natural to reindex.

@toobaz Yes, I think we decided anyway that it should be consistent, so once we change to have 0 the default, it will also be for resample.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ericdf picture ericdf  Â·  3Comments

ebran picture ebran  Â·  3Comments

andreas-thomik picture andreas-thomik  Â·  3Comments

tade0726 picture tade0726  Â·  3Comments

songololo picture songololo  Â·  3Comments