Pandas: append / concat category with different categories fails in HDF5

Created on 8 Nov 2016 · 9Comments · Source: pandas-dev/pandas

Hi,

just a remark, I guess it must be fairly complicated to replicate the behaviour from pandas ( appending categoricals, with different categories) into HDF5.

Currently this fails

pd1.to_hdf(store_file,"/ISE_nombre_de_tabla",format="table",append=True)
pd2..to_hdf(store_file,"/ISE_nombre_de_tabla",format="table",append=True)

.
.
.

1698             if new_metadata is not None and cur_metadata is not None \
1699                     and not array_equivalent(new_metadata, cur_metadata):
-> 1700                 raise ValueError("cannot append a categorical with "
1701                                  "different categories to the existing")
1702 

ValueError: cannot append a categorical with different categories to the existing

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

Categorical Enhancement IO HDF5

Source

littlegreenbean33

Most helpful comment

Yes, I would expect HDF5 append to reflect the same behaviour as it has been implemented in pandas in 0.19. with regards to categories concat/append.

This should not fail. Maybe a feature for a future version.

Cheers

littlegreenbean33 on 8 Nov 2016

👍4

All 9 comments

What is exactly your question or the point you want to make? Do you want this not to fail?

jorisvandenbossche on 8 Nov 2016

Yes, I would expect HDF5 append to reflect the same behaviour as it has been implemented in pandas in 0.19. with regards to categories concat/append.

This should not fail. Maybe a feature for a future version.

Cheers

littlegreenbean33 on 8 Nov 2016

👍4

this is not so easy to do and would require someone to step up and implement this

jreback on 8 Nov 2016

As of today, does anyone have any workaround or possible solution?

I tried to create a dummy table that would contain all my categories, to later append the actual data, and finally drop the first _n_ rows that have the dummy data, but it didn't work. I get the same error:

ValueError: cannot append a categorical with different categories to the existing

iipr on 19 May 2017

@iipr issue is still open, if you're interested in submitting a fix. I don't believe anyone is working on this at the moment.

TomAugspurger on 19 May 2017

union_categoricals does this, so its a matter of using this internally when appending. The key thing is that you cannot change existing code->label mappings as these are already written/coded.

jreback on 19 May 2017

Hi all,

I think I discovered a bug related to this but not exactly the same issue.

The issue is that using set_categories causes an issue when storing into HDF. Here's a minimal working example

import pandas as pd

df = pd.DataFrame({'idx':[1,2],'val':['A','B']})
df['val'] = df['val'].astype('category')

df['val'].cat.set_categories(['C','B','A'],inplace=True)
df.to_hdf('test.h5','test',format='table',data_columns=True)

temp_cats = ['A','B']
query_string = "val in %s" % temp_cats
stored_df = pd.read_hdf('test.h5','test',where=query_string)

print(len(stored_df)) # Prints 0, i.e. no rows in stored_df

I would expect stored_df to equal the original dataframe but it's empty.

If you remove this line df['val'].cat.set_categories(['C','B','A'],inplace=True), then things work as intended. Similarly, if you change the line to df['val'].cat.set_categories(['A','B','C'],inplace=True), it works as intended.

Interestingly, it also works if I do test_df['val'].cat.add_categories(['C'],inplace=True), which gives the same effect (I think) and is my current work around.

My versions:

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.16.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: None
pandas_gbq: None
pandas_datareader: None

stoddardg on 7 Jun 2017

@stoddardg can you create a new issue for your example. It is different that this one and looks like a bug.

jreback on 11 Jun 2017

I had the same issue and came up with a workaround:

I create a new category dtype which retains the old codes for values already present in the hdf5 file and adds possible new categories.
I then cast my current data to this new dtype with astype() (even though the documentation states that this is not possible, it works as intended) which adjusts the codes.
Finally I manually update the metadata such that the write operation doesn't throw an exception.

table_storer = store.get_storer(key)
isin_meta = table_storer.read_metadata('isin') # returns Series of categories (maps code -> category_value)

cat_idx = df.index.levels[1]

# this might have a performance impact
categorical = pd.Categorical(isin_meta)
merged_dtype = union_categoricals([categorical, cat_idx]).dtype # order is important so that old codes stay the same

# Cast 'isin' to the newly merged dtype
df.index = df.index.set_levels(cat_idx.astype(merged_dtype), level=1)

# Write new metadata, so that the subsequent write won't fail due to mismatching dtypes
table_storer.write_metadata('isin', merged_dtype.categories)