Pandas: BUG: pd.crosstab fails when passed multiple columns, margins True and normalize True

Created on 6 Jul 2020 · 6Comments · Source: pandas-dev/pandas

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] (optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
          "bar", "bar", "bar", "bar"],
                    "B": ["one", "one", "one", "two", "two",
                          "one", "one", "two", "two"],
                    "C": ["small", "large", "large", "small",
                          "small", "large", "small", "small",
                          "large"],
                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})

pd.crosstab(index=df.C,
            columns=[df.A,df.B],  
            margins=True,
            margins_name='Sub-Total',
            normalize='all')

# returns ValueError: Sub-Total not in pivoted DataFrame

Problem description

27663 found that `pd.crosstab` failed when `margins` and `normalize` were true. This continues to be the case when more than one column is passed.

This is not the case when a single column and multiple rows is passed.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 47.1.1.post20200604
Cython : None
pytest : None
hypothesis : None
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : 1.2.9
numba : None

[paste the output of pd.show_versions() here leaving a blank line after the details tag]

Bug Reshaping

Source

HughKelley

All 6 comments

this still exists on the latest version.

CloseChoice on 6 Jul 2020

👍1

take

CloseChoice on 6 Jul 2020

I am seeing the same issue given the version below. Same error as og:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
   ...:           "bar", "bar", "bar", "bar"],
   ...:                     "B": ["one", "one", "one", "two", "two",
   ...:                           "one", "one", "two", "two"],
   ...:                     "C": ["small", "large", "large", "small",
   ...:                           "small", "large", "small", "small",
   ...:                           "large"],
   ...:                     "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
   ...:                     "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})

In [3]: pd.crosstab(index=df.C,
   ...:             columns=[df.A,df.B],
   ...:             margins=True,
   ...:             margins_name='Sub-Total',
   ...:             normalize='all')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-8e93d08650b4> in <module>
      3             margins=True,
      4             margins_name='Sub-Total',
----> 5             normalize='all')

...

ValueError: Sub-Total not in pivoted DataFrame

In [4]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : db08276bc116c438d3fdee492026f8223584c477
python           : 3.7.9.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.7.0
Version          : Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.3
numpy            : 1.19.2
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 50.3.1.post20201107
Cython           : None
pytest           : 6.1.1
hypothesis       : 5.37.4
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.18.1
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : 0.51.2

bdatko on 13 Nov 2020

Hi @bdatko looks like you have pandas 1.1.3 installed. The PR closechoice made to fix this issue was merged into 1.2. I believe if you install from github instead of pypi or a conda source you can include that code. for instance pip install [email protected]:pandas-dev/pandas.git will do the trick i think.

HughKelley on 13 Nov 2020

@HughKelley Thank you for the information. Reviewing the the 1.2.0 release notes under reshaping the bug was indeed fixed. I installed pandas 1.2.0 from Github using

pip install git+https://github.com/pandas-dev/pandas

and the og's example works as expected. =D

A side note:
I couldn't install using your suggested pip command. When I do pip install [email protected]:pandas-dev/pandas.git I just get

$ pip install [email protected]:pandas-dev/pandas.git
ERROR: Invalid requirement: '[email protected]:pandas-dev/pandas.git'
Hint: It looks like a path. File '[email protected]:pandas-dev/pandas.git' does not exist.

bdatko on 16 Nov 2020

I couldn't install

Sorry i use an SSH key to communicate with github. You probably would have succeeded with https://github.com/pandas-dev/pandas.git instead as it relies on https instead of ssh.

glad you got it working

HughKelley on 16 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cannot use apply on Series with Timestamp values

nathanielatom · 3Comments

df.duplicated and drop_duplicates raise TypeError with set and list values.

Abrosimov-a-a · 3Comments

BUG: fillna with inplace does not work with multiple columns selection by loc

hiiwave · 3Comments

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity?

jaradc · 3Comments

to_sql UnicodeEncodeError

matthiasroder · 3Comments