When applying multiple aggregations to columns and setting margins=True I receive a KeyError. I believe that because multiple aggregations are applied the columns become a MultiIndex, which is unexpected when computing margins.
>>> import pandas as pd
>>> import numpy as np
>>> import random
>>> df = pd.DataFrame({'random1': [random.random() for i in range(10)],
'random2': [random.random() for i in range(10)],
'type': ['duck', 'bird']*5},
index=range(10,20))
>>> df.pivot_table(index='type',
aggfunc={'random1': [np.median, np.mean],
'random2': np.sum},
margins=True)
KeyError Traceback (most recent call last)
<ipython-input> in <module>()
3 aggfunc={'random1': [np.median, np.mean],
4 'random2': np.sum},
----> 5 margins=True)
/pandas/util/decorators.pyc in wrapper(*args, **kwargs)
86 else:
87 kwargs[new_arg_name] = new_arg_value
---> 88 return func(*args, **kwargs)
89 return wrapper
90 return _deprecate_kwarg
/pandas/util/decorators.pyc in wrapper(*args, **kwargs)
86 else:
87 kwargs[new_arg_name] = new_arg_value
---> 88 return func(*args, **kwargs)
89 return wrapper
90 return _deprecate_kwarg
/pandas/tools/pivot.pyc in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna)
145 if margins:
146 table = _add_margins(table, data, values, rows=index,
--> 147 cols=columns, aggfunc=aggfunc)
148
149 # discard the top level
/pandas/tools/pivot.pyc in _add_margins(table, data, values, rows, cols, aggfunc)
189 row_margin[k] = grand_margin[k]
190 else:
--> 191 row_margin[k] = grand_margin[k[0]]
192
193 margin_dummy = DataFrame(row_margin, columns=[key]).T
KeyError: 'random1'
The _exact_ same error occurs with
aggfunc={'random1': {'median': np.median, 'mean': np.mean},
'random2': np.sum}
These errors do not occur when margins=False
This is using pandas version 0.15.2. I have not seen any changes which would lead me to believe it is be fixed in master.
Happy to create a PR, although I haven't done one before.
xref to here: https://github.com/pydata/pandas/issues/10582
love to have a PR!
It is any plan to resolve this issue in future release?
@lukaszkiszka you could submit a PR to fix!
I'd like to have a look into this, is anyone working on it already? @lukaszkiszka?
@tmo go ahead - I was going to look on this in future - in a few weeks.
If you will found resolution I will be glad when you will add me as reviewer.
Hi @tmo,
If you could check into this, it would be really great.
I would like to do it, but is is above my experience level to fix this in pandas codebase.
Hello,
It has been a quite some time since this post received stimulation: do you know if this issue will be resolved. I ran into it tonight while trying to use pandas for a very large pivot table 25M observations. I am receiving the error: IndexError: index 993158425 is out of bounds for axis 0 with size 993157686
This causes problems on daily basis,
because pivot-table very frequently returns multi-index result
which fails on error with if parameter margins=True .
4 years is quite a long time and it would be
really excellent, if someone could help and fix this.
anyone is welcome to submit a PR
pandas is all volunteer and there are 3000+ issues open
I will try to start (at least) and hopefully with some help we can fix this.
I hope it cannot be so hard ;)
Just quick check, if I am on good path to solution:
_add_marginsIs it best first place to start looking for problem and fixing this issue,
or is there something else I should be aware of?
Hi Stefan,
I think you are the right page. I think the issue is a scaling issue when massive amounts of data is applied.
I think taking a test set i.e. here:
Using the MovieLens dataset 25M it seems at some point between the 1M and 25M set pandas has a limit on what the matrix could hold in terms of memory
Warm Regards,
Alfred
From: Stefan Simik notifications@github.com
Sent: Sunday, March 15, 2020 7:51 PM
To: pandas-dev/pandas pandas@noreply.github.com
Cc: Alfred Hull alfred.d.hull@gmail.com; Comment comment@noreply.github.com
Subject: Re: [pandas-dev/pandas] pivot_table throws exception with multiple aggregations per column and margins=True (#12210)
I will try to start (at least) and hopefully with some help we can fix this.
I hope it cannot be so hard ;)
Just quick check, if I am on good path to solution:
Is it best first place to start looking for problem and fixing this issue,
or is there something else I should be aware of?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/12210#issuecomment-599283007 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AHP7T23O2IN7HUV2VH3BI4DRHVSUTANCNFSM4B2HEPSA . https://github.com/notifications/beacon/AHP7T22J7E5S64UD6IOTP7DRHVSUTA5CNFSM4B2HEPSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEO4FKPY.gif
Stefan,
Thank you again for attempting this. :)
Warm Regards,
Alfred
Hello, I have the exact same issue with this bug. Does anyone have the solution for this?
Probably not, because there is not enough advanced pandas users, which are feeling
pain from this specific issue (despite, it is basic bug)
I wanted to check, but it is a little above my skillset to modify pandas internals.
But I still take it as challenge to check it once, because this issue is not fixed a few years already...
Most helpful comment
Hi @tmo,
If you could check into this, it would be really great.
I would like to do it, but is is above my experience level to fix this in pandas codebase.