Pandas: Crosstab Not Working with Duplicate Column Labels

Created on 28 Aug 2018  路  5Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

1. Example with 2 lambdas applied on same column

df = pd.DataFrame([[0, 1], [0, 1], [1, 0]], columns=['a', 'b'])
df
a b
0 0 1
1 0 1
2 1 0

pd.crosstab(df['a'].apply(lambda x: x), df['a'].apply(lambda x: x+1))
a 1 2
a
1 2 0
2 0 1

2. Example by manually renaming a column

df = pd.DataFrame([[0, 1], [0, 1], [1, 0]], columns=['a', 'b'])
df
a b
0 0 1
1 0 1
2 1 0

pd.crosstab(df['a'], df['b'])
b 0 1
a
0 0 2
1 1 0

s = df['b']
s.name = 'a'
pd.crosstab(df['a'], s)
a 0 1
a
0 1 0
1 0 2

In both cases, the output is not the one expected. The crosstab applies one Series to itself.

Problem description

I encountered the problem when using a crosstab on the same column with 2 different lambdas. The crosstab output a crosstab applying the second column to itself. It seems that crosstab is confused by the same name of the 2 series.

The problem is the same when the 2nd column is manually renamed with the name of the 1rst one, before applying the crosstab. See examples.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.3.2
pip: 10.0.0
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug Reshaping

Most helpful comment

To be clear, this has nothing to do with lambdas, right? In the following, Out[26] is the expected output, but with the name a substituted for b?

In [25]: pd.crosstab(pd.Series([0, 0, 1, 1], name='a'), pd.Series([0, 1, 0, 1], name='a'))
Out[25]:
a  0  1
a
0  2  0
1  0  2

In [26]: pd.crosstab(pd.Series([0, 0, 1, 1], name='a'), pd.Series([0, 1, 0, 1], name='b'))
Out[26]:
b  0  1
a
0  1  1
1  1  1

All 5 comments

To be clear, this has nothing to do with lambdas, right? In the following, Out[26] is the expected output, but with the name a substituted for b?

In [25]: pd.crosstab(pd.Series([0, 0, 1, 1], name='a'), pd.Series([0, 1, 0, 1], name='a'))
Out[25]:
a  0  1
a
0  2  0
1  0  2

In [26]: pd.crosstab(pd.Series([0, 0, 1, 1], name='a'), pd.Series([0, 1, 0, 1], name='b'))
Out[26]:
b  0  1
a
0  1  1
1  1  1

I suppose we overwrite the key of https://github.com/pandas-dev/pandas/blob/95bc7d1038194ed10c8a86fd503e8f7a086396d1/pandas/core/reshape/pivot.py#L506

I'm not sure whether we should raise if there's conflicting names, or whether we should try to support this.

As I explained, I discovered the problem with lambda and then I tried to understand the problem. But you are right, the issue should be renamed: crosstab does not work with 2 different series having the same name.
What about suffixing the 2 series names by '_x' and '_y' if they are the same?

@TomAugspurger : Duplicate names have generally been quite problematic for pandas in general, but I'm generally not a fan of mangling columns unless we find that truly acceptable for sake of the support.

I made a PR that raises an error when trying to use duplicated column names. Does it make sense to you?

Was this page helpful?
0 / 5 - 0 ratings