Let's consider this dataframe with both integer and string values in two columns
import pandas as pd
df = pd.DataFrame({'col1': [0,1,1,0,1], 'col2':list('aabbc')})
When I want to use str.get_dummies on one column and want -1 instead of 1, I can do:
print (-df.col2.str.get_dummies())
a b c
0 -1 0 0
1 -1 0 0
2 0 -1 0
3 0 -1 0
4 0 0 -1
which is correct, but at first I tried with pd.get_dummies and I got a value of 255 instead of -1:
print (-pd.get_dummies(df.col2)) #or print (-pd.get_dummies(df[['col2']])) same result
a b c
0 255 0 0
1 255 0 0
2 0 255 0
3 0 255 0
4 0 0 255
If I used only the integer column col1, I get the same behavior, but if I use both columns at once, then I get the right result:
print (-pd.get_dummies(df))
col1 col2_a col2_b col2_c
0 0 -1 0 0
1 -1 -1 0 0
2 -1 0 -1 0
3 0 0 -1 0
4 -1 0 0 -1
As explained in the doc , if you don't specify the columns parameter, then _all the columns with object or category dtype will be converted_, col1 is then ignored as supposed. But if I pass both columns in the parameter, I get 255 instead of -1 again:
print (-pd.get_dummies(df, columns = ['col1','col2']))
col1_0 col1_1 col2_a col2_b col2_c
0 255 0 255 0 0
1 0 255 255 0 0
2 0 255 0 255 0
3 255 0 0 255 0
4 0 255 0 0 255
I found this with pd.__version__ == "0.24.2".
Note: if you do multiply by -1 instead of just the sign -, then print (-1*pd.get_dummies(df.col2)) gives the expected result
pd.show_versions()commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.2
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml.etree: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The default data type for pd.get_dummies is unit8 so the 255 is a result of overflow when subtracting 1 from 0.
Definitely some inconsistency here that needs to be addressed but curious what use case you have for assigning -1 instead of 0 here? Memory savings from the top level pd.get_dummies will be significant compared to the one called via the .str accessor (8 bits vs 64 bits per entry) so may be preferred to use that smaller dtype save overflow issue
Thanks for the explanation. I just wanted to perform a subtraction between 2 columns once obtained their dummies columns. I did not think directly about str.get_dummies. I'm not at a level where memory savings is my concern, more a randomly found result that I tried to understand. But the fact that unit8 is the default data type makes all sense now. Thank you
Yea so I think there鈥檚 two issues here. One where the dtype between these calls is inconsistent and the other where the unary subtraction doesn鈥檛 handle overflow. Cc @jbrockmendel for any thoughts on the latter
This is consistent with numpy's behavior, so I don't think we'd want to change how we handle it:
In [2]: a = np.array([0, 1], dtype='uint8')
In [3]: a
Out[3]: array([0, 1], dtype=uint8)
In [4]: -a
Out[4]: array([ 0, 255], dtype=uint8)
Could we maybe have pd.get_dummies default to int8 instead of uint8? It seems like that behavior would be more consistent with what users would expect while maintaining memory savings:
In [5]: a2 = np.array([0, 1], dtype='int8')
In [6]: a2
Out[6]: array([0, 1], dtype=int8)
In [7]: -a2
Out[7]: array([ 0, -1], dtype=int8)
Not sure if we have any special reliance on the unsigned nature of uint8 that would prevent us from using int8 though, so would need to investigate and double check that.
Could we maybe have
pd.get_dummiesdefault toint8instead ofuint8?
I think uint8 is typically used for bool in the NumPy space (or at least internally in Cythonized code as well) so I think this makes sense.
I'm going to repurpose this issue for the variance between the two ways of calling get_dummies as I think that is the main thing to address here
We're not as consistent about it as we could/should be, but in some places we raise instead of silently overflowing. Casting instead of overflowing would also make sense in this context