Pandas: pd.get_dummies different default return type than .str.get_dummies

Created on 3 Jul 2019 · 6Comments · Source: pandas-dev/pandas

Let's consider this dataframe with both integer and string values in two columns

import pandas as pd
df = pd.DataFrame({'col1': [0,1,1,0,1], 'col2':list('aabbc')})

When I want to use str.get_dummies on one column and want -1 instead of 1, I can do:

print (-df.col2.str.get_dummies())
   a  b  c
0 -1  0  0
1 -1  0  0
2  0 -1  0
3  0 -1  0
4  0  0 -1

which is correct, but at first I tried with pd.get_dummies and I got a value of 255 instead of -1:

print (-pd.get_dummies(df.col2)) #or print (-pd.get_dummies(df[['col2']])) same result
     a    b    c
0  255    0    0
1  255    0    0
2    0  255    0
3    0  255    0
4    0    0  255

If I used only the integer column col1, I get the same behavior, but if I use both columns at once, then I get the right result:

print (-pd.get_dummies(df))
   col1  col2_a  col2_b  col2_c
0     0      -1       0       0
1    -1      -1       0       0
2    -1       0      -1       0
3     0       0      -1       0
4    -1       0       0      -1

As explained in the doc , if you don't specify the columns parameter, then _all the columns with object or category dtype will be converted_, col1 is then ignored as supposed. But if I pass both columns in the parameter, I get 255 instead of -1 again:

print (-pd.get_dummies(df, columns = ['col1','col2']))
   col1_0  col1_1  col2_a  col2_b  col2_c
0     255       0     255       0       0
1       0     255     255       0       0
2       0     255       0     255       0
3     255       0       0     255       0
4       0     255       0       0     255

I found this with pd.__version__ == "0.24.2".
Note: if you do multiply by -1 instead of just the sign -, then print (-1*pd.get_dummies(df.col2)) gives the expected result

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml.etree: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Dtypes

Source

BenPwc

All 6 comments

The default data type for pd.get_dummies is unit8 so the 255 is a result of overflow when subtracting 1 from 0.

Definitely some inconsistency here that needs to be addressed but curious what use case you have for assigning -1 instead of 0 here? Memory savings from the top level pd.get_dummies will be significant compared to the one called via the .str accessor (8 bits vs 64 bits per entry) so may be preferred to use that smaller dtype save overflow issue

WillAyd on 5 Jul 2019

👍1

Thanks for the explanation. I just wanted to perform a subtraction between 2 columns once obtained their dummies columns. I did not think directly about str.get_dummies. I'm not at a level where memory savings is my concern, more a randomly found result that I tried to understand. But the fact that unit8 is the default data type makes all sense now. Thank you

BenPwc on 5 Jul 2019

Yea so I think there’s two issues here. One where the dtype between these calls is inconsistent and the other where the unary subtraction doesn’t handle overflow. Cc @jbrockmendel for any thoughts on the latter

WillAyd on 5 Jul 2019

This is consistent with numpy's behavior, so I don't think we'd want to change how we handle it:

In [2]: a = np.array([0, 1], dtype='uint8')

In [3]: a
Out[3]: array([0, 1], dtype=uint8)

In [4]: -a
Out[4]: array([  0, 255], dtype=uint8)

Could we maybe have pd.get_dummies default to int8 instead of uint8? It seems like that behavior would be more consistent with what users would expect while maintaining memory savings:

In [5]: a2 = np.array([0, 1], dtype='int8')

In [6]: a2
Out[6]: array([0, 1], dtype=int8)

In [7]: -a2
Out[7]: array([ 0, -1], dtype=int8)

Not sure if we have any special reliance on the unsigned nature of uint8 that would prevent us from using int8 though, so would need to investigate and double check that.

jschendel on 5 Jul 2019

👍1

Could we maybe have pd.get_dummies default to int8 instead of uint8?

I think uint8 is typically used for bool in the NumPy space (or at least internally in Cythonized code as well) so I think this makes sense.

I'm going to repurpose this issue for the variance between the two ways of calling get_dummies as I think that is the main thing to address here

WillAyd on 8 Jul 2019

We're not as consistent about it as we could/should be, but in some places we raise instead of silently overflowing. Casting instead of overflowing would also make sense in this context