Pandas: append / concat category with different categories fails in HDF5

Created on 8 Nov 2016  路  9Comments  路  Source: pandas-dev/pandas

Hi,

just a remark, I guess it must be fairly complicated to replicate the behaviour from pandas ( appending categoricals, with different categories) into HDF5.

Currently this fails

pd1.to_hdf(store_file,"/ISE_nombre_de_tabla",format="table",append=True)
pd2..to_hdf(store_file,"/ISE_nombre_de_tabla",format="table",append=True)

.
.
.

1698             if new_metadata is not None and cur_metadata is not None \
1699                     and not array_equivalent(new_metadata, cur_metadata):
-> 1700                 raise ValueError("cannot append a categorical with "
1701                                  "different categories to the existing")
1702 

ValueError: cannot append a categorical with different categories to the existing

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

Categorical Enhancement IO HDF5

Most helpful comment

Yes, I would expect HDF5 append to reflect the same behaviour as it has been implemented in pandas in 0.19. with regards to categories concat/append.

This should not fail. Maybe a feature for a future version.

Cheers

JC

All 9 comments

What is exactly your question or the point you want to make? Do you want this not to fail?

Yes, I would expect HDF5 append to reflect the same behaviour as it has been implemented in pandas in 0.19. with regards to categories concat/append.

This should not fail. Maybe a feature for a future version.

Cheers

JC

this is not so easy to do and would require someone to step up and implement this

As of today, does anyone have any workaround or possible solution?

I tried to create a dummy table that would contain all my categories, to later append the actual data, and finally drop the first _n_ rows that have the dummy data, but it didn't work. I get the same error:

ValueError: cannot append a categorical with different categories to the existing

@iipr issue is still open, if you're interested in submitting a fix. I don't believe anyone is working on this at the moment.

union_categoricals does this, so its a matter of using this internally when appending. The key thing is that you cannot change existing code->label mappings as these are already written/coded.

Hi all,

I think I discovered a bug related to this but not exactly the same issue.

The issue is that using set_categories causes an issue when storing into HDF. Here's a minimal working example

import pandas as pd

df = pd.DataFrame({'idx':[1,2],'val':['A','B']})
df['val'] = df['val'].astype('category')

df['val'].cat.set_categories(['C','B','A'],inplace=True)
df.to_hdf('test.h5','test',format='table',data_columns=True)

temp_cats = ['A','B']
query_string = "val in %s" % temp_cats
stored_df = pd.read_hdf('test.h5','test',where=query_string)

print(len(stored_df)) # Prints 0, i.e. no rows in stored_df

I would expect stored_df to equal the original dataframe but it's empty.

If you remove this line df['val'].cat.set_categories(['C','B','A'],inplace=True), then things work as intended. Similarly, if you change the line to df['val'].cat.set_categories(['A','B','C'],inplace=True), it works as intended.

Interestingly, it also works if I do test_df['val'].cat.add_categories(['C'],inplace=True), which gives the same effect (I think) and is my current work around.

My versions:

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.16.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: None
pandas_gbq: None
pandas_datareader: None

@stoddardg can you create a new issue for your example. It is different that this one and looks like a bug.

I had the same issue and came up with a workaround:

I create a new category dtype which retains the old codes for values already present in the hdf5 file and adds possible new categories.
I then cast my current data to this new dtype with astype() (even though the documentation states that this is not possible, it works as intended) which adjusts the codes.
Finally I manually update the metadata such that the write operation doesn't throw an exception.

table_storer = store.get_storer(key)
isin_meta = table_storer.read_metadata('isin') # returns Series of categories (maps code -> category_value)

cat_idx = df.index.levels[1]

# this might have a performance impact
categorical = pd.Categorical(isin_meta)
merged_dtype = union_categoricals([categorical, cat_idx]).dtype # order is important so that old codes stay the same

# Cast 'isin' to the newly merged dtype
df.index = df.index.set_levels(cat_idx.astype(merged_dtype), level=1)

# Write new metadata, so that the subsequent write won't fail due to mismatching dtypes
table_storer.write_metadata('isin', merged_dtype.categories)
Was this page helpful?
0 / 5 - 0 ratings