Hi,
just a remark, I guess it must be fairly complicated to replicate the behaviour from pandas ( appending categoricals, with different categories) into HDF5.
Currently this fails
pd1.to_hdf(store_file,"/ISE_nombre_de_tabla",format="table",append=True)
pd2..to_hdf(store_file,"/ISE_nombre_de_tabla",format="table",append=True)
.
.
.
1698 if new_metadata is not None and cur_metadata is not None \
1699 and not array_equivalent(new_metadata, cur_metadata):
-> 1700 raise ValueError("cannot append a categorical with "
1701 "different categories to the existing")
1702
ValueError: cannot append a categorical with different categories to the existing
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
What is exactly your question or the point you want to make? Do you want this not to fail?
Yes, I would expect HDF5 append to reflect the same behaviour as it has been implemented in pandas in 0.19. with regards to categories concat/append.
This should not fail. Maybe a feature for a future version.
Cheers
JC
this is not so easy to do and would require someone to step up and implement this
As of today, does anyone have any workaround or possible solution?
I tried to create a dummy table that would contain all my categories, to later append the actual data, and finally drop the first _n_ rows that have the dummy data, but it didn't work. I get the same error:
ValueError: cannot append a categorical with different categories to the existing
@iipr issue is still open, if you're interested in submitting a fix. I don't believe anyone is working on this at the moment.
union_categoricals does this, so its a matter of using this internally when appending. The key thing is that you cannot change existing code->label mappings as these are already written/coded.
Hi all,
I think I discovered a bug related to this but not exactly the same issue.
The issue is that using set_categories causes an issue when storing into HDF. Here's a minimal working example
import pandas as pd
df = pd.DataFrame({'idx':[1,2],'val':['A','B']})
df['val'] = df['val'].astype('category')
df['val'].cat.set_categories(['C','B','A'],inplace=True)
df.to_hdf('test.h5','test',format='table',data_columns=True)
temp_cats = ['A','B']
query_string = "val in %s" % temp_cats
stored_df = pd.read_hdf('test.h5','test',where=query_string)
print(len(stored_df)) # Prints 0, i.e. no rows in stored_df
I would expect stored_df to equal the original dataframe but it's empty.
If you remove this line df['val'].cat.set_categories(['C','B','A'],inplace=True), then things work as intended. Similarly, if you change the line to df['val'].cat.set_categories(['A','B','C'],inplace=True), it works as intended.
Interestingly, it also works if I do test_df['val'].cat.add_categories(['C'],inplace=True), which gives the same effect (I think) and is my current work around.
My versions:
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-514.16.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: None
pandas_gbq: None
pandas_datareader: None
@stoddardg can you create a new issue for your example. It is different that this one and looks like a bug.
I had the same issue and came up with a workaround:
I create a new category dtype which retains the old codes for values already present in the hdf5 file and adds possible new categories.
I then cast my current data to this new dtype with astype() (even though the documentation states that this is not possible, it works as intended) which adjusts the codes.
Finally I manually update the metadata such that the write operation doesn't throw an exception.
table_storer = store.get_storer(key)
isin_meta = table_storer.read_metadata('isin') # returns Series of categories (maps code -> category_value)
cat_idx = df.index.levels[1]
# this might have a performance impact
categorical = pd.Categorical(isin_meta)
merged_dtype = union_categoricals([categorical, cat_idx]).dtype # order is important so that old codes stay the same
# Cast 'isin' to the newly merged dtype
df.index = df.index.set_levels(cat_idx.astype(merged_dtype), level=1)
# Write new metadata, so that the subsequent write won't fail due to mismatching dtypes
table_storer.write_metadata('isin', merged_dtype.categories)
Most helpful comment
Yes, I would expect HDF5 append to reflect the same behaviour as it has been implemented in pandas in 0.19. with regards to categories concat/append.
This should not fail. Maybe a feature for a future version.
Cheers
JC