Pandas: Pivot / unstack on large data frame does not work int32 overflow

Created on 8 May 2019  Â·  42Comments  Â·  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

predictor_purchases_p = predictor_purchases.groupby(["ag", "artikelnr"])["som"].max().unstack().fillna(0)

or

predictor_purchases_p = predictor_purchases.pivot(index="ag", columns="artikelnr", value="som")

Problem description

I'm working on rather large data (>100GB in memory) on a beefy server (3TB ram)
When refactoring my code from pandas 0.21 to latest version, the pivot / unstack now returns an exception.

Unstacked DataFrame is too big, causing int32 overflow

I was able to eliminate the problem by changing the reshape.py:
modify line 121from dtype np.int32 to dtype np.int64:
num_cells = np.multiply(num_rows, num_columns, dtype=np.int64)

Expected Output

Not being limited by int32 dims when reshaping a data frame.
This feels like a restriction which should not be there.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: 0.12.1
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: 0.3.0
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Numeric Reshaping

Most helpful comment

The point I'm trying to raise is: why is the number of cells limited to max value of np.int32?
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

This creates constraints when working with large data frames.
Basically, I'm proposing to change this this to np.int64.

All 42 comments

try on master this was recently patched

Corporate server, so long story short, I can only work with conda releases.
I'm not able to pull anything off github and make a build.

https://github.com/pandas-dev/pandas/pull/23512/files

if you are trying to do this approach your problem differently

The point I'm trying to raise is: why is the number of cells limited to max value of np.int32?
num_cells = np.multiply(num_rows, num_columns, dtype=np.int32)

This creates constraints when working with large data frames.
Basically, I'm proposing to change this this to np.int64.

so you have 2B columns?

My current dataset has
RangeIndex: 2584251 entries
Columns: 4539 entries

num_cells = 2584251 * 4539 = 11.729.915.289 cells

So, I have 11.7 B cells

Putting a int32 contstraint on number of cells is way to small for my datasets.
I'm quite sure this causes problems for other users.

Btw, in pandas 0.21 I could execute these unstacks without problems. Size was not an issue.
Upgrading to pandas 0.24.2 removes this ability.

Yeah, I'm bumping into this as well. I'm trying to make 2.87 billion cells in my unstack. I saw issue #20601 and that basically throwing a more informative error made more sense than increasing to from int32 to int64. I kind of agree with that assessment. It would be nice to have an option in unstack that would let you tell it that you expect to have a very large number of cells and switch to int64 index.

I'm not sure how difficult this would be or if it would be worth it to satisfy people with long lists of product-user pairs like me.

@Rblivingstone certainly would take a PR to allow int64 here; it’s not tested and that’s why we raise

I'm getting this error on 6Gb of memory use with pivoting the movielens large ratings.csv. So I agree in this day and age we need int64.

Turns out I get the same error even after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64), although the error definitely looks like it occurs there.

I get "ValueError: negative dimensions are not allowed" after changing the reshape.py line to num_cells = np.multiply(num_rows, num_columns, dtype=np.int64).

Any chance we have a different workaround? I only have 6000 columns...

I have same issue. I use 9.9 BG memory data.

df.pivot_table(index='uno', columns=['programid'], values='avg_time')

ValueError: Unstacked DataFrame is too big, causing int32 overflow

pandas 0.25.0

@buddingPop @Code4SAFrankie @meganson are any of you interested in working on this?

We also need a reproducible example, if anyone can provide that.

You can download the ratings.csv file from
https://www.kaggle.com/grouplens/movielens-20m-dataset

Then create a pivottable as below to reproduce the problem:
pivotTable = ratings.pivot_table(index=['userId'],columns=['movieId'],values='rating')

We won't be able to include downloading that dataset in a unit test. I assume this can be reproduced with the right style of random data? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

Same issue here.
df.set_index(['Element', 'usix']).unstack().reset_index(col_level=1)
ValueError: Unstacked DataFrame is too big, causing int32 overflow
pandas: 0.23.4

@TomAugspurger Let me try to generate similar data with random data.

@TomAugspurger

This is a very naive test, but seems to re-produced the error for me.
`import random
import string
import pandas as pd

row_cnt = 4000000
c1_unique_val_cnt = 1500000
c2_unique_val_cnt = 1600

c1_set = [ i for i in range(c1_unique_val_cnt)]
c1 = [ random.choice(c1_set) for i in range(row_cnt)]
c2_set = [ i for i in range(c2_unique_val_cnt)]
c2 = [ random.choice(c2_set) for i in range(row_cnt)]

df_test = pd.DataFrame({'c1':c1, 'c2': c2 })
t = df_test.set_index(['c1', 'c2']).unstack()`

Produces an error: "ValueError: Unstacked DataFrame is too big, causing int32 overflow"

Thanks @vengertsev. A similar example that's a bit faster, since it vectorizes the column generation

In [62]: df = pd.DataFrame(np.random.randint(low=0, high=1500000, size=(4000000, 2)), columns=['a', 'b'])

In [63]: df.set_index(['a', 'b']).unstack()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-63-2ee2ef4b1279> in <module>
----> 1 df.set_index(['a', 'b']).unstack()

~/sandbox/pandas/pandas/core/frame.py in unstack(self, level, fill_value)
   6311         from pandas.core.reshape.reshape import unstack
   6312
-> 6313         return unstack(self, level, fill_value)
   6314
   6315     _shared_docs[

~/sandbox/pandas/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
    408     if isinstance(obj, DataFrame):
    409         if isinstance(obj.index, MultiIndex):
--> 410             return _unstack_frame(obj, level, fill_value=fill_value)
    411         else:
    412             return obj.T.stack(dropna=False)

~/sandbox/pandas/pandas/core/reshape/reshape.py in _unstack_frame(obj, level, fill_value)
    438             value_columns=obj.columns,
    439             fill_value=fill_value,
--> 440             constructor=obj._constructor,
    441         )
    442         return unstacker.get_result()

~/sandbox/pandas/pandas/core/reshape/reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
    135
    136         if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 137             raise ValueError("Unstacked DataFrame is too big, causing int32 overflow")
    138
    139         self._make_sorted_values_labels()

ValueError: Unstacked DataFrame is too big, causing int32 overflow

@TomAugspurger, Nice, thanks!

Anyone interested in working on this now that we have a reproducible example?

@TomAugspurger


ValueError Traceback (most recent call last)
in
----> 1 ratings = pd.read_csv('/home/ml/rating_stg.csv').groupby(['uno', 'program_title'])['view_percent'].mean().unstack()
2 ratings.head()

~/anaconda3/lib/python3.7/site-packages/pandas/core/series.py in unstack(self, level, fill_value)
3299 """
3300 from pandas.core.reshape.reshape import unstack
-> 3301 return unstack(self, level, fill_value)
3302
3303 # ----------------------------------------------------------------------

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value)
394 unstacker = _Unstacker(obj.values, obj.index, level=level,
395 fill_value=fill_value,
--> 396 constructor=obj._constructor_expanddim)
397 return unstacker.get_result()
398

~/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/reshape.py in __init__(self, values, index, level, value_columns, fill_value, constructor)
122
123 if num_rows > 0 and num_columns > 0 and num_cells <= 0:
--> 124 raise ValueError('Unstacked DataFrame is too big, '
125 'causing int32 overflow')
126

ValueError: Unstacked DataFrame is too big, causing int32 overflow

Hello,

i am facing the same issue trying to create a pivot table , if someone can help !

Thank you

We need a PR to fix it in pandas.

On Wed, Sep 18, 2019 at 7:57 AM imanekho notifications@github.com wrote:

Hello,

i am facing the same issue trying to create a pivot table , if someone can
help !

Thank you

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/26314?email_source=notifications&email_token=AAKAOIWZIGTP6E5QAUW6DOLQKIQR5A5CNFSM4HLQBXN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD676ZYY#issuecomment-532671715,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOIX6READO4P5TIOPTILQKIQR5ANCNFSM4HLQBXNQ
.

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

this is the only solution you think? any alternative solution may be
i am working on my final project and i have a large dataset to late to change it
thank yiu

You could attempt to downgrade your pandas to 0.21 (See here https://github.com/pandas-dev/pandas/issues/26314#issuecomment-490502770). I ran into a similar problem recently and downgraded and it worked on one of my datasets. It isn't the best solution but it might get you moving.

thank you

Can i have an idea how you dwnograded your version i tried to intsall the 21.0 that i get alot of error on denpendencies with other libraries
Thank you

Can i have an idea how you dwnograded your version i tried to intsall the 21.0 that i get alot of error on denpendencies with other libraries
Thank you

pip install pandas==0.21

this is exactly the command i used

this is exactly the command i used

try with Python 3.6

Any news on this, I'm failing to roll back to 0.21. My next course of action is to rewrite vectorised function as a loop but I'd rather not.

I got this error in ver. 0.25.3, any news on it?

Still open, still looking for a volunteer to fix it.

Any update on this?
Does downgrading the Pandas help??

pandas is an all volunteer project; PRs would be accepted from the community for this or any other issues

I have been stuck with this some days and finally I have fixed it by changing int32 to int64. Would it be all right to send a pull request with it?

@KaonToPion please do.

@KaonToPion Please let us know once you checkin... I would like to have this fix ASAP... Thanks a lot!

I am sorry, I am having issues with installing the pandas development enviroment that's why it's taking long

@KaonToPion also waiting on this fix. If you could please take another effort a try that would be great. Thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

MatzeB picture MatzeB  Â·  3Comments

andreas-thomik picture andreas-thomik  Â·  3Comments

venuktan picture venuktan  Â·  3Comments

BDannowitz picture BDannowitz  Â·  3Comments

nathanielatom picture nathanielatom  Â·  3Comments