pandas 🚀 - Sum of all NA and empty should be 0

I've been following this conversation a bit and remain confused on one point. Options 1 and 2 both appear to rely on the premise that Series([nan, nan]).sum() _should_ equal Series([]).sum().
Without having an intuition for why this is a desirable behavior, the merits of those two options hard to evaluate. Is it just consistency with [other software]? LMK if this question belongs elsewhere.

jbrockmendel on 7 Dec 2017

Here's fine. I think the shortest answer is that it's nice to have s.sum(skipna=True) equal s.dropna().sum(skipna=True/False).

TomAugspurger on 7 Dec 2017

👍1

I would keep the more general discussion on the mailing list, and here more the details of the changes we then want to make

jorisvandenbossche on 7 Dec 2017

👍1

this is not happening for 0.21.1

jreback on 8 Dec 2017

this is not happening for 0.21.1

I think the question is rather: do we want to release this quickly, or only in a few months? (and then once answered, we can still discuss whether we call that 0.21.1, 0.21.2 or 0.22.0). But let's keep that discussion in https://github.com/pandas-dev/pandas/issues/18244#issuecomment-350000655

jorisvandenbossche on 8 Dec 2017

Just to add a data point I tried to update to .21 and a large number of our unit tests failed so I came looking to see if this had been reported as a bug yet. A lot of our code assumes pd.Series([]).sum() == 0 for simple things like dollar weighted sums.

We won't be upgrading until this is fixed and the fact a breaking change like this was made has lead us to seriously consider prohibiting pandas use within critical sections of our code base.

ryanbressler on 12 Dec 2017

@ryanbressler that discussion is happening on the mailing list: https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html

We're using the issue to design the fix.

TomAugspurger on 12 Dec 2017

I've started a branch for this at https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:na-sum. I'll have a PR tomorrow, but briefly my approach was to add a parameter to sum and prod: empty_is_na. Here's the docstring for sum:

Signature: df.sum(axis=None, skipna=True, level=None, numeric_only=None, empty_is_na=False, **kwargs)
Docstring:
Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values before computing the result.
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a Series
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, then use only numeric data. Not implemented for
    Series.
empty_is_na : bool, default False
    The result of operating on an empty array should be NA. The default
    behavior is for the sum of an empty array to be 0, and the product
    of an empty array to be 1.

Returns
-------
sum : Series or DataFrame (if level specified)
File:      ~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/generic.py
Type:      method

In [3]: pd.Series().sum()
Out[3]: 0.0

In [4]: pd.Series().sum(empty_is_na=True)
Out[4]: nan

TomAugspurger on 12 Dec 2017

@ryanbressler would you like to voice this experience on the mailing list?

jorisvandenbossche on 12 Dec 2017

@TomAugspurger and @jorisvandenbossche it looks like I need to subscribe to that list to post? I'll do that after some meetings I have this morning. Feel free to delete my comments If they aren't appropriate on this thread but I must say expecting users to join the dev discussion list to report issues feels a bit unfriendly.

ryanbressler on 12 Dec 2017

I must say expecting users to join the dev discussion list to report issues feels a bit unfriendly.

Not to report issues, that's just already been done -- this issue :)

We're trying to keep the discussion from fragmenting. Mailing List for
discussion what behavior we want, this issue for designing that behavior.
And usually most discussion happens on GitHub. This happens to be on the
mailing list since it's of broad interest, and not everybody
can keep up with the traffic on the GitHub repository.

On Tue, Dec 12, 2017 at 12:28 PM, Ryan Bressler notifications@github.com
wrote:

@TomAugspurger https://github.com/tomaugspurger and @jorisvandenbossche
https://github.com/jorisvandenbossche it looks like I need to subscribe
to that list to post? I'll do that after some meetings I have this morning.
Feel free to delete my comments If they aren't appropriate on this thread
but I must say expecting users to join the dev discussion list to report
issues feels a bit unfriendly.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18678#issuecomment-351142958,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIoYBQ1mE554c8WRq6GjpNRJcicUqks5s_sXJgaJpZM4Q5w6R
.

TomAugspurger on 12 Dec 2017

@TomAugspurger regarding your proposal above, that would only deal with the empty case? So that would not yet provide an option for returning NaN on an all NaN series?

jorisvandenbossche on 12 Dec 2017

Yep, so far just handling the empty case. My preference is for additional keywords to prod and sum, rather than a new method.

For the all-NA case,

Signature: df.sum(axis=None, skipna=True, level=None, numeric_only=None, empty_is_na=False, all_na_is_na=False, **kwargs)
Docstring:
Return the sum of the values for the requested axis

Parameters
----------
axis : {index (0), columns (1)}
skipna : boolean, default True
    Exclude NA/null values before computing the result.
level : int or level name, default None
    If the axis is a MultiIndex (hierarchical), count along a
    particular level, collapsing into a Series
numeric_only : boolean, default None
    Include only float, int, boolean columns. If None, will attempt to use
    everything, then use only numeric data. Not implemented for
    Series.
empty_is_na : bool, default False
    The result of operating on an empty array should be NA. The default
    behavior is for the sum of an empty array to be 0, and the product
    of an empty array to be 1.
all_na_is_na : bool, default False
    The result of operating on an all-NA array should be NA. The default
    behavior is for the sum of an all-NA array to be 0, and the product
    of an all-NA array to be 1.

Returns
-------
sum : Series or DataFrame (if level specified)
File:      ~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/generic.py
Type:      method

Are there any problematic interactions between these new keywords and skipna? Does the empty_is_na apply to it being empty before or after skipping the NA values? I'd propose that it applies to the before-skipping version, so that empty_is_na only has an effect when the original series is length-0.

So for an all-NA series, s = pd.Series([np.nan])

s.sum(skipna=True, empty_is_na=False, all_na_is_na=False) -> 0
s.sum(skipna=True, empty_is_na=True, all_na_is_na=True) -> NaN
s.sum(skipna=True, empty_is_na=True, all_na_is_na=False) -> 0
s.sum(skipna=True, empty_is_na=False, all_na_is_na=True) -> NaN

Now skipna=False. How should we handle all_na_is_na?

s.sum(skipna=False, empty_is_na=False, all_na_is_na=False) -> 0 (this differs from np.sum)
s.sum(skipna=False empty_is_na=True, all_na_is_na=True) -> NaN
s.sum(skipna=False, empty_is_na=True, all_na_is_na=False) -> 0 (this differs from np.sum)
s.sum(skipna=False, empty_is_na=False, all_na_is_na=True) -> NaN

1 and 3 concern me a bit. I'll think about it a bit more.

For an empty series, the only thing that matters is empty_is_na. With empty_is_na=False, the return value is 0. With empty_is_na=True, the return value is NA, regardless of skipna and all_na_is_na.

TomAugspurger on 12 Dec 2017

My preference is for additional keywords to prod and sum, rather than a new method.

I am personally also in favor of additional keyword(s) instead of a new method. I don't really like the empty_is_na and all_na_is_na names, and that fact that it are two additional keywords, but I assume they are at least explicit and allow for controlling both cases separately (which might be difficult with a single keyword like min_count ?)

s.sum(skipna=False, empty_is_na=False, all_na_is_na=False) -> 0 (this differs from np.sum)

This one concerns me as well. I think the all_na_is_na only makes sense if you are skipping NAs. I can't really imagine were you would want Series([1, NA]).sum() to return NA but Series([NA, NA]).sum() to return 0.
If skipna=False, any occurence of a NA should turn the result into NA. Also the deviation from numpy would be unfortunate (as now skipna=False basically gives you numpy's sum).

jorisvandenbossche on 13 Dec 2017

allow for controlling both cases separately (which might be difficult with a single keyword like min_count ?)

I think I talked myself out of the need for two keywords. My empty_is_na keyword should only have an effect if len(s) is 0, which means there isn't even any NaNs.

I'm not fond of empty_is_na / all_na_is_na either, but I find them easier to explain that min_count. Though making Series([np.nan], skipna=False) return NaN (which we want) will complicate the explanation of all_na_is_na, so perhaps they're not easier to explain after all.

TomAugspurger on 13 Dec 2017

Indeed, do we really need two new keywords?

I think empty_is_na should suffice, to satisfy both:

pd.Series([np.nan]).sum(skipna=True, empty_is_na=True) -> NaN
pd.Series([]).sum(skipna=True, empty_is_na=True) -> NaN

For skipna=False, which we only need for completeness:

pd.Series([np.nan]).sum(skipna=False, empty_is_na=True) -> NaN
pd.Series([]).sum(skipna=False, empty_is_na=True) -> NaN

(empty_is_na=False would always have an empty or all-NaN sum be 0.)

The potentially compelling argument in favor of two different keyword arguments is that that gives you a way to use the legacy sum from pandas <0.21 (without bottleneck). It would be an easy switch to use when migrating code, but even then I suspect there are only a very small number of applications that need to different behavior between an empty and all-NaN sum. If need be, I would prefer empty_is_na='legacy' to adding a separate all_na_is_na argument.

shoyer on 13 Dec 2017

I would just add

fill_value=0 (if we really want option 1), or fill_value=None for option 2

then this would all work and be quite simple.

jreback on 13 Dec 2017

(empty_is_na=False would always have an empty or all-NaN sum be 0.)

This seems strange, right? And that differs from pandas 0.20.3 with or without bottleneck, which always returned 0 for pd.Series([]).sum(skipna=False)

EDIT: sorry I misread. Ignore this.

TomAugspurger on 13 Dec 2017

@jreback would your fill_value only apply when skipna=True? Otherwise, we'd break pd.Series([np.nan]).sum(skipna=False) -> NaN.

TomAugspurger on 13 Dec 2017

For reference, here's the 0.20.3 behavior on two series pd.Series([]) and pd.Series([np.nan])

	use_bn	skipna	length	result
case
0	True	True	0	0.0
1	True	True	1	0.0
2	True	False	0	0.0
3	True	False	1	NaN
4	False	True	0	0.0
5	False	True	1	NaN
6	False	False	0	0.0
7	False	False	1	NaN

import pandas as pd

import itertools

use_bn = [True, False]

skipna = [True, False]

series = [pd.Series(), pd.Series([np.nan])]

values = []

for bn, skip, s in itertools.product(use_bn, skipna, series):
    with pd.option_context("compute.use_bottleneck", bn):
        result = s.sum(skipna=skip)
        values.append((bn, skip, len(s), result))

r = pd.DataFrame(values, columns=['use_bn', 'skipna', 'length', 'result'])
r.index.name = 'case'

TomAugspurger on 13 Dec 2017

Could people give thoughts on this table for the desired behavior?

| case | skipna | length | empty_is_na | result |
| -----| ------ | ------ | ----------- | ------ |
| A | True | empty | False | 0 |
| B | True | empty | True | NaN |
| C | True | all-na | False | 0 |
| D | True | all-na | True | NaN? |
| E | False | empty | False | 0? |
| F | False | empty | True | NaN |
| G | False | all-na | False | NaN |
| H | False | all-na | True | NaN |

I think if we're OK with D being NaN, then we only need the one keyword (whatever it's called). The semantics for empty_is_na=True with skipna=True is then "Is the array empty after removing NaNs? If so, they result is NaN". If we want D to be 0, then we'll need another keyword, or a 'legacy' option.

With case D=NaN, I don't think we achieve our goal of supporting users who want "NaN if everything is NA" though.

TomAugspurger on 13 Dec 2017

@TomAugspurger I agree with the rule "Is the array empty after removing NaNs? If so, they result is NaN"

To me, this implies D=NaN, exactly as you reasoned.

The other missing entry in your table is Case E=0. This is the same as the sum of an empty array in NumPy (np.sum([])).

With case D=NaN, I don't think we achieve our goal of supporting users who want "NaN if everything is NA" though.

I'm confused. Isn't that exactly what D=NaN ensures? empty_is_na achieves the SQL-like behavior that (some) users pine for.

empty_is_na='legacy' would be for users who can't make up their mind between whether they're a DBA or a mathematician, and want a rule that makes sense "most" of the time. (Of course, that's the worst sort of consistency for APIs, because it suggests a behavior that doesn't hold in all edge cases.)

shoyer on 13 Dec 2017

I'm confused. Isn't that exactly what D=NaN ensures? empty_is_na achieves the SQL-like behavior that (some) users pine for.

Sorry, I think I was staring at that table for too long and started going crazy :)

TomAugspurger on 13 Dec 2017

I'm not fond of empty_is_na / all_na_is_na either, but I find them easier to explain that min_count

+1

@jreback In the end, a empty_is_na=True/False or fill_value=np.nan/0 would be equivalent in behaviour I think, so it is a question of which name we like most.
But I think in this case "empty_is_na is easier to explain" also is valid for fill_value IMO. Because fill_value would actually be the value to return in case of empty data (after skipping the NAs or not). So it would rather be something like "empty_result_value".

@TomAugspurger I agree with the rule "Is the array empty after removing NaNs? If so, they result is NaN"

To me, this implies D=NaN, exactly as you reasoned.

I agree too.

empty_is_na='legacy' would be for users who can't make up their mind between whether they're a DBA or a mathematician

So that would provide the option of all-NA giving NA (so to be signaled you have a full NA column), but keep the empty sum 0.
Do we think this is a valid option to want? Because if so, we shouldn't call it "legacy" but something like "mixed" (although that certainly doesn't 'tell' you what it does neither).

Something else, regarding "extra keyword vs new method". The advantage of a separate method is that for people who consistently want to use the non-default option, it could be much easier / less verbose to have an alternative method instead to have to add emtpy_is_na=True all over their code base (of course they can make such an alternative function themselves, but then it is not a method on the dataframe).
(one problem with this, however, is that a possible name like .total() would be very confusing if we choose for option 1 as the default, as then total() would be option 2, which makes it the other way around as in SQL)

jorisvandenbossche on 13 Dec 2017

So that would provide the option of all-NA giving NA (so to be signaled you have a full NA column), but keep the empty sum 0.
Do we think this is a valid option to want?

In all the user feedback on this issue, I do not recall any requests for this mixed behavior. So I think we can safely leave it out.

shoyer on 13 Dec 2017

The advantage of a separate method is that for people who consistently want to use the non-default option, it could be much easier / less verbose to have an alternative method instead to have to add emtpy_is_na=True all over their code base

Yes, I was thinking that this morning as well...

TomAugspurger on 13 Dec 2017

Turning to groupby with categoricals, if we implement option 1, we'll want to have categoricals with no observations return 0 for sum.

In [32]: df = pd.DataFrame({"A": pd.Categorical(['a', 'b', 'a'],
    ...:                                         categories=['a', 'b', 'c']),
    ...:                    'B': [1, 2, 1]})
    ...: result = df.groupby("A").B.sum()
    ...:

In [33]: result
Out[33]:
A
a    2
b    2
c    0
Name: B, dtype: int64

Currently, aggregations like np.nansum don't behave as I would expect:

In [3]: pd.__version__
Out[3]: '0.20.3'

In [8]: df.groupby("A").B.agg(lambda x: np.nansum(x))
Out[8]:
A
a    2.0
b    2.0
c    NaN
Name: B, dtype: float64

Perhaps custom aggregation functions aren't called on an empty array to see what the return type is?

TomAugspurger on 13 Dec 2017

Another (large) place this affects is upsampling:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.20.3'

In [3]: pd.Series(range(4), index=pd.date_range(start='2017', periods=4, freq='1H')).resample('30T').sum()
Out[3]:
2017-01-01 00:00:00    0.0
2017-01-01 00:30:00    NaN
2017-01-01 01:00:00    1.0
2017-01-01 01:30:00    NaN
2017-01-01 02:00:00    2.0
2017-01-01 02:30:00    NaN
2017-01-01 03:00:00    3.0
Freq: 30T, dtype: float64

On my branch:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.22.0.dev0+364.g70e9bf903.dirty'

In [3]: pd.Series(range(4), index=pd.date_range(start='2017', periods=4, freq='1H')).resample('30T').sum()
Out[3]:
2017-01-01 00:00:00    0
2017-01-01 00:30:00    0
2017-01-01 01:00:00    1
2017-01-01 01:30:00    0
2017-01-01 02:00:00    2
2017-01-01 02:30:00    0
2017-01-01 03:00:00    3
Freq: 30T, dtype: int64

which is I suspect what we want, but another thing I'll make sure to document.

TomAugspurger on 13 Dec 2017

👍1

which is I suspect what we want

Thoughts on this? Do we want .resample(higher_freq).sum() to introduce 0s or NaNs?

TomAugspurger on 19 Dec 2017

i am -1 i’m changing anything groupby / resample

sum of NaN is NaN; anything else is completely confusing

jreback on 19 Dec 2017

A sum of empty/all-NaN values in resample should be no different than a sum of empty/all-NaN values elsewhere. So that means if we make sum of empty/all-NaN be 0, then it should be 0 for resample, too.

The fact that it's not entirely clear whether resampling is over all-NaN or empty values is another indication that it is important to keep the empty sum and all-NaN sum consistent.

It would be extremely confusing to use different rules for resample/groupby than elsewhere.

shoyer on 19 Dec 2017

well sum of all-NaN should be NaN as discussed many times
sure sum of empty can be 0

jreback on 19 Dec 2017

well sum of all-NaN should be NaN as discussed many times

I think the majority opinion in https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html is that the sum of all-NaN should be 0.

TomAugspurger on 19 Dec 2017

I agree we should be consistent, so if we choose for option empty/all-NaN to be 0 for normal sum, it should also be the case for resample/groupby.
Although I kind of liked the NaNs in the upsampling, and this could be one of the cases where this would actually pop up quite often in user code and have a possible larger impact on users when we change it. We could maybe consider raising a warning first that it will change (and can already pass the keyword to silence it)

jorisvandenbossche on 19 Dec 2017

I agree we should be consistent, so if we choose for option empty/all-NaN
to be 0 for normal sum, it should also be the case for resample/groupby.
Although I kind of liked the NaNs in the upsampling

I am similarly conflicting on this one, which is why I re-reaised it before
the call tomorrow.

On Tue, Dec 19, 2017 at 2:55 PM, Joris Van den Bossche <
[email protected]> wrote:

I agree we should be consistent, so if we choose for option empty/all-NaN
to be 0 for normal sum, it should also be the case for resample/groupby.
Although I kind of liked the NaNs in the upsampling, and this could be one
of the cases where this would actually pop up quite often in user code and
have a possible larger impact on users when we change it. We could
maybe consider raising a warning first that it will change (and can already
pass the keyword to silence it)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/18678#issuecomment-352883691,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIo5Y7_tI0sZVTsCh8k8-7nxR1ye1ks5tCCLPgaJpZM4Q5w6R
.

TomAugspurger on 19 Dec 2017

Dipping my toes into this: I personally would be in favor of a separate function along similar lines to what @jorisvandenbossche said above.

In addition, the difference between the two sum / prod functions would be handling of these zero-NaN cases discussed in the issue, while the "core" (private) sum / prod functionality would handle all other cases (that's my current thinking ATM)

One might consider calling these functions np_sum and np_prod because the results would mirror those from when you call np.array([]).sum and np.array([nan]).sum() depending on whether you want to skip-over NaN or not.

For example, if you want all-NaN to sum to NaN, pass in skipna=False. Otherwise, the sum is zero, as the element set that you're summing over would technically be empty.

gfyoung on 20 Dec 2017

@TomAugspurger to avoid confusion, should rename v0.22.0 in whatsnew to v0.23.0 and add a new v0.22.0

jreback on 21 Dec 2017

Thoughts on this? Do we want .resample(higher_freq).sum() to introduce 0s or NaNs?

I just came across this behaviour today, and I was surprised/annoyed I didn't get 0s: because (as I maybe repeated too much already) that's what I expected, but also because I needed to build a histogram, so I had to manually fillna() (and astype(int), as I was working with integers). If I had wanted NaNs, it would have been more natural to reindex.

toobaz on 22 Dec 2017

@toobaz Yes, I think we decided anyway that it should be consistent, so once we change to have 0 the default, it will also be for resample.

jorisvandenbossche on 22 Dec 2017

Pandas: Sum of all NA and empty should be 0

All 39 comments

Related issues