Pandas: PERF: groupby with many empty groups memory blowup

Created on 29 Dec 2019 · 22Comments · Source: pandas-dev/pandas

Suppose we have a Categorical with many unused categories:

cat = pd.Categorical(range(24), categories=range(10**5))

df = pd.DataFrame({"A": cat, "B": range(24), "C": range(24), "D": 1})

gb = df.groupby(["A", "B", "C"])

>>> gb.size()  # memory balloons to 9+ GB before i kill it

There are only 24 rows in this DataFrame, so we shouldn't be creating millions of groups.

Without the Categorical, but just a large cross-product that implies many empty groups, this works fine:

df = pd.DataFrame({n: range(12) for n in range(8)})

gb = df.groupby(list(range(7)))
gb.size() # <-- works fine

Categorical Groupby Performance

Source

jbrockmendel

Most helpful comment

I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical.

TomAugspurger on 3 Jan 2020

👍2

All 22 comments

this is the point of the observed keyword

jreback on 30 Dec 2019

So yea to second Jeff’s comment above is this a problem with observed=True?

WillAyd on 2 Jan 2020

passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError

jbrockmendel on 2 Jan 2020

👍1

Did we ever discussion changed the default observed keyword? @TomAugspurger

Sent from my iPhone

On Jan 2, 2020, at 4:21 PM, jbrockmendel notifications@github.com wrote:

passing observed=True solves the problem. id like to add a warning or something for users who find themselves about to hit a MemoryError

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.

WillAyd on 2 Jan 2020

I don't recall a discussion about that.

I vaguely recall that this will be somewhat solved by having a DictEncodedArray that has a similar data model to Categorical, without the unobserved / fixed categories semantics.

TomAugspurger on 2 Jan 2020

Hmm I think that is orthogonal. IIUC the memory blowup is because by default we are generating Cartesian products. Would rather just deprecate that and switch the default value for observed

WillAyd on 3 Jan 2020

👍2

I think it’s a good default for the original design semantics of Categorical. It’s a bad default for the memory-saving aspect of categorical.

TomAugspurger on 3 Jan 2020

👍2

Sounds like we can close this as no-action?

On Thu, Jan 2, 2020 at 3:40 PM Tom Augspurger notifications@github.com
wrote:

I think it’s a good default for the original design semantics of
Categorical. It’s a bad default for the memory-saving aspect of categorical.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/30552?email_source=notifications&email_token=AB5UM6AK4AKANOARO7MCO53Q3Z3NJA5CNFSM4KBD67Q2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH72B5I#issuecomment-570401013,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB5UM6BQJEOLOICRGMW4KZTQ3Z3NJANCNFSM4KBD67QQ
.

jbrockmendel on 3 Jan 2020

i suppose we would show a PerformanceWarning if we detect this would happen which is pretty cheap to do

might be worthwhile

jreback on 3 Jan 2020

👍1

I cannot reproduce the issue above with cat = pd.Categorical(range(24), categories=range(10**5)) on pandas 0.25.3.

When upgrading from pandas 0.24.2 to 0.25.3, I had a memory issue with a groupby().agg() on a data frame with 100 000 rows and 11 columns. I used 8 grouping variables with a mix of categorical and character variables and the grouping operation was using over 8Gb of memory.

Setting the argument observed=True:

df.groupby(index, observed=True)

fixed the memory issue.

Related Stack Overflow question: Pandas v 0.25 groupby with many columns gives memory error.

Maybe observed=True should be the default? At least beyond a certain ratio of observed / all possible combinations. When the observed combinations of categorical values is way lower than all possible combination of these categorical values, it is clear that it doesn't make sense to use observed=False. Is there a discussion on why the default was set to observed=False?

paulrougieux on 14 Jan 2020

What are the actions to be taken in this issue? Changing the default, depreciating argument, documenting, or something else? It is not clear from current discussion.
Could anyone provide a use case for a observed=False?

jangorecki on 23 Mar 2020

We're not changing the behavior. The only proposal that's on the table is detecting and warning when we have / are about to allocate too much memory.

Could anyone provide a use case for a observed=False?

When you have a fixed set of categories that should persist across operations (e.g. survey results).

TomAugspurger on 23 Mar 2020

Thanks. It is pretty significant regression, not very new one already, but still.

And how is it useful for grouping? Empty groups are being returned?
Usually it is addressed by a right outer join to such categorical dictionary.

jangorecki on 23 Mar 2020

What's the regression? All categories are present in the output of a groupby by design, including unobserved categories.

TomAugspurger on 23 Mar 2020

Regression is that code that worked fine on 0.24.2 is now hitting MemoryError

jangorecki on 23 Mar 2020

Did it have the correct output in 0.24.2? In pandas 0.24.2 I have

In [8]: cat = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])

In [9]: df = pd.DataFrame({"A": cat, "B": [1, 2], "C": [3, 4]})

In [10]: df.groupby(["A", "B"]).size()
Out[10]:
A  B
a  1    1
b  2    1
dtype: int64

which is incorrect. So it sounds like you were relying on buggy behavior.

TomAugspurger on 24 Mar 2020

@jangorecki
Could anyone provide a use case for a observed=False?

@TomAugspurger
When you have a fixed set of categories that should persist across operations (e.g. survey results).

In the context of multiple categorical variables used as the groupby index. Persisting a fixed set of categories (inside each categorical variable) is a different issue i.e. maybe the observed argument should be disentangled in two arguments: (1) one argument to simply keep unobserved categories. For example observed=False would tell pandas to keep unobserved categories inside the categorical variable (i.e. not change the categories). And (2) another argument would tell pandas to preallocate memory for all possible combination of categorical variables. For example preallocate=True would preallocate memory and preallocate=False would not.
But I might have misunderstood what @TomAugspurger meant.

paulrougieux on 25 Mar 2020

Sorry @paulrougieux, I don't follow your proposal. Could you add an example what the two keywords would do?

observed=True/False doesn't affect the dtype of the index. That will always be the same dtype as the grouper.

In [7]: pd.Series([1, 2, 3]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=True).sum().index.dtype
Out[7]: CategoricalDtype(categories=['a', 'b'], ordered=False)

In [8]: pd.Series([1, 2, 3]).groupby(pd.Categorical(['a', 'a', 'a'], categories=['a', 'b']), observed=False).sum().index.dtype
Out[8]: CategoricalDtype(categories=['a', 'b'], ordered=False)

It only controls whether the aggregation function is applied to the unobserved category's groups as well.

TomAugspurger on 25 Mar 2020

@TomAugspurger I misunderstood your previous comment about the "fixed set of categories that should persist across operations ". Thanks for clarifying.

paulrougieux on 25 Mar 2020

I'm interested in working on this, and the related issue (#34162)!

I'll dig into it and post my progress on the thread.

arw2019 on 15 May 2020

@arw2019 recently discussed #32918 looks to be related as well

jangorecki on 15 May 2020

Trying to think through the examples here and whether my current "usual" columns are more or less prevalent to not want sql group by semantics by default. E.g., for the case of grouping by cities and states, they're both factors and they're not independent. I guess you could say that they should be single city-state factors?

When you have a fixed set of categories that should persist across operations (e.g. survey results).

FWIW, I think this is what I use unstack for but am not sure. Could you post an example workflow to compare?

Edit: Well, I just ran into a case where the factors are independent, and I was relying on the current default behavior. You want the Cartesian product in (almost?) all of these cases. E.g., something like counts of species at field sites. Just because you don't observe a deer doesn't mean that you couldn't have observed a deer. But, yeah, I think typically I use an unstack and/or an outer join to solve this. I'll have to pay attention to this keyword in almost every case now that I know about it.