Pandas: Pivot / unstack on large data frame does not work int32 overflow

Created on 8 May 2019 · 42Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

predictor_purchases_p = predictor_purchases.groupby(["ag", "artikelnr"])["som"].max().unstack().fillna(0)

or

predictor_purchases_p = predictor_purchases.pivot(index="ag", columns="artikelnr", value="som")

Problem description

I'm working on rather large data (>100GB in memory) on a beefy server (3TB ram)
When refactoring my code from pandas 0.21 to latest version, the pivot / unstack now returns an exception.

Unstacked DataFrame is too big, causing int32 overflow

I was able to eliminate the problem by changing the reshape.py:
modify line 121from dtype np.int32 to dtype np.int64:
num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)

Expected Output

Not being limited by int32 dims when reshaping a data frame.
This feels like a restriction which should not be there.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: 0.3.0
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Numeric Reshaping

Source

MarkiesFredje

👍21

Most helpful comment

The point I'm trying to raise is: why is the number of cells limited to max value of np.int32?
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

This creates constraints when working with large data frames.
Basically, I'm proposing to change this this to np.int64.

MarkiesFredje on 8 May 2019

👍13

All 42 comments

try on master this was recently patched

jreback on 8 May 2019

Corporate server, so long story short, I can only work with conda releases.
I'm not able to pull anything off github and make a build.

MarkiesFredje on 8 May 2019

https://github.com/pandas-dev/pandas/pull/23512/files

if you are trying to do this approach your problem differently

jreback on 8 May 2019

The point I'm trying to raise is: why is the number of cells limited to max value of np.int32?
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

This creates constraints when working with large data frames.
Basically, I'm proposing to change this this to np.int64.

MarkiesFredje on 8 May 2019

👍13

so you have 2B columns?

jreback on 8 May 2019

My current dataset has
RangeIndex: 2584251 entries
Columns: 4539 entries

num_cells = 2584251 * 4539 = 11.729.915.289 cells

So, I have 11.7 B cells

Putting a int32 contstraint on number of cells is way to small for my datasets.
I'm quite sure this causes problems for other users.

MarkiesFredje on 8 May 2019

Btw, in pandas 0.21 I could execute these unstacks without problems. Size was not an issue.
Upgrading to pandas 0.24.2 removes this ability.

MarkiesFredje on 8 May 2019

Yeah, I'm bumping into this as well. I'm trying to make 2.87 billion cells in my unstack. I saw issue #20601 and that basically throwing a more informative error made more sense than increasing to from int32 to int64. I kind of agree with that assessment. It would be nice to have an option in unstack that would let you tell it that you expect to have a very large number of cells and switch to int64 index.

I'm not sure how difficult this would be or if it would be worth it to satisfy people with long lists of product-user pairs like me.

Rblivingstone on 27 Jun 2019

👍2

@Rblivingstone certainly would take a PR to allow int64 here; it’s not tested and that’s why we raise

jreback on 28 Jun 2019

I'm getting this error on 6Gb of memory use with pivoting the movielens large ratings.csv. So I agree in this day and age we need int64.

Code4SAFrankie on 24 Jul 2019

👍3

Turns out I get the same error even after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64), although the error definitely looks like it occurs there.

Code4SAFrankie on 24 Jul 2019

👍1

I get "ValueError: negative dimensions are not allowed" after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64).

Any chance we have a different workaround? I only have 6000 columns...

buddingPop on 1 Aug 2019

👍1

I have same issue. I use 9.9 BG memory data.

df.pivot_table(index='uno', columns=['programid'], values='avg_time')

ValueError: Unstacked DataFrame is too big, causing int32 overflow

pandas 0.25.0

meganson on 12 Aug 2019

@buddingPop @Code4SAFrankie @meganson are any of you interested in working on this?

We also need a reproducible example, if anyone can provide that.

TomAugspurger on 12 Aug 2019

You can download the ratings.csv file from
https://www.kaggle.com/grouplens/movielens-20m-dataset

Then create a pivottable as below to reproduce the problem:
pivotTable = ratings.pivot_table(index=['userId'],columns=['movieId'],values='rating')

subhrajitb on 17 Aug 2019

We won't be able to include downloading that dataset in a unit test. I assume this can be reproduced with the right style of random data? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

TomAugspurger on 19 Aug 2019

Same issue here.
df.set_index(['Element', 'usix']).unstack().reset_index(col_level=1)
ValueError: Unstacked DataFrame is too big, causing int32 overflow
pandas: 0.23.4

@TomAugspurger Let me try to generate similar data with random data.

vengertsev on 26 Aug 2019

@TomAugspurger

This is a very naive test, but seems to re-produced the error for me.
`import random
import string
import pandas as pd

row_cnt = 4000000
c1_unique_val_cnt = 1500000
c2_unique_val_cnt = 1600

c1_set = [ i for i in range(c1_unique_val_cnt)]
c1 = [ random.choice(c1_set) for i in range(row_cnt)]
c2_set = [ i for i in range(c2_unique_val_cnt)]
c2 = [ random.choice(c2_set) for i in range(row_cnt)]

df_test = pd.DataFrame({'c1':c1, 'c2': c2 })
t = df_test.set_index(['c1', 'c2']).unstack()`

Produces an error: "ValueError: Unstacked DataFrame is too big, causing int32 overflow"

vengertsev on 26 Aug 2019

Thanks @vengertsev. A similar example that's a bit faster, since it vectorizes the column generation

In [62]: df = pd.DataFrame(np.random.randint(low=0, high=1500000, size=(4000000, 2)), columns=['a', 'b'])

In [63]: df.set_index(['a', 'b']).unstack()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-2ee2ef4b1279> in <module>
----> 1 df.set_index(['a', 'b']).unstack()

~/sandbox/pandas/pandas/core/frame.py in unstack(self, level, fill_value)
   6311         from pandas.core.reshape.reshape import unstack
   6312
-> 6313         return unstack(self, level, fill_value)
   6314
   6315     _shared_docs[

~/sandbox/pandas/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
    408     if isinstance(obj, DataFrame):
    409         if isinstance(obj.index, MultiIndex):
--> 410             return _unstack_frame(obj, level, fill_value=fill_value)
    411         else:
    412             return obj.T.stack(dropna=False)

~/sandbox/pandas/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
    438             value_columns=obj.columns,
    439             fill_value=fill_value,
--> 440             constructor=obj._constructor,
    441         )
    442         return unstacker.get_result()

~/sandbox/pandas/pandas/core/reshape/reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
    135
    136         if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 137             raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
    138
    139         self._make_sorted_values_labels()

ValueError: Unstacked DataFrame is too big, causing int32 overflow

TomAugspurger on 26 Aug 2019

@TomAugspurger, Nice, thanks!

vengertsev on 26 Aug 2019

Anyone interested in working on this now that we have a reproducible example?

TomAugspurger on 26 Aug 2019

@TomAugspurger

ValueError Traceback (most recent call last)
in
----> 1 ratings = pd.read_csv('/home/ml/rating_stg.csv').groupby(['uno', 'program_title'])['view_percent'].mean().unstack()
2 ratings.head()

~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in unstack(self, level, fill_value)
3299 """
3300 from pandas.core.reshape.reshape import unstack
-> 3301 return unstack(self, level, fill_value)
3302
3303 # ----------------------------------------------------------------------

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
394 unstacker = _Unstacker(obj.values, obj.index, level=level,
395 fill_value=fill_value,
--> 396 constructor=obj._constructor_expanddim)
397 return unstacker.get_result()
398

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
122
123 if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 124 raise ValueError('Unstacked DataFrame is too big, '
125 'causing int32 overflow')
126

ValueError: Unstacked DataFrame is too big, causing int32 overflow

meganson on 28 Aug 2019

Hello,

i am facing the same issue trying to create a pivot table , if someone can help !

Thank you

imanekho on 18 Sep 2019

We need a PR to fix it in pandas.

On Wed, Sep 18, 2019 at 7:57 AM imanekho notifications@github.com wrote:

Hello,

i am facing the same issue trying to create a pivot table , if someone can
help !

Thank you

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/26314?email_source=notifications&email_token=AAKAOIWZIGTP6E5QAUW6DOLQKIQR5A5CNFSM4HLQBXN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD676ZYY#issuecomment-532671715,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOIX6READO4P5TIOPTILQKIQR5ANCNFSM4HLQBXNQ
.

TomAugspurger on 18 Sep 2019

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

imanekho on 18 Sep 2019

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

You could attempt to downgrade your pandas to 0.21 (See here https://github.com/pandas-dev/pandas/issues/26314#issuecomment-490502770). I ran into a similar problem recently and downgraded and it worked on one of my datasets. It isn't the best solution but it might get you moving.

ericbrown on 18 Sep 2019

thank you

imanekho on 18 Sep 2019

Can i have an idea how you dwnograded your version i tried to intsall the 21.0 that i get alot of error on denpendencies with other libraries
Thank you

imanekho on 18 Sep 2019

Can i have an idea how you dwnograded your version i tried to intsall the 21.0 that i get alot of error on denpendencies with other libraries
Thank you

pip install pandas==0.21

ericbrown on 18 Sep 2019

this is exactly the command i used

imanekho on 18 Sep 2019