I get several PerformanceWarnings when I store my dataframe in a hdfstore:
C:\portabel\Python27\lib\site-packages\pandas\io\pytables.py:1788: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot map
directly to c-types [inferred_type->mixed,key->axis0]
warnings.warn(ws, PerformanceWarning)
C:\portabel\Python27\lib\site-packages\pandas\io\pytables.py:1788: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot map
directly to c-types [inferred_type->mixed,key->block0_values]
warnings.warn(ws, PerformanceWarning)
C:\portabel\Python27\lib\site-packages\pandas\io\pytables.py:1788: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot map
directly to c-types [inferred_type->unicode,key->block0_items]
What I can't get from this is what column gives me these problems, at least I don't have any "block0" columns :-) It would be nice if this warnings can give me an indicator what i can actually do about this warnings.
You are storing Stores (meaning not a Table), which means that PyTables is pickling some type of data. Several options. Split out the data to separate nodes (that node will still have the warning, but the rest will be faster), or you can save it as a Table (which should support it a little better). Can you show me a sample of the data and df.dtypes?
also...update to master, I just added #3623 which should make the warnings slightly more informative
Here is some code which produces these warnings:
from data_names import (hdf_store_name, hdf_aaa, csv_aaa)
aaa = pandas.read_csv(csv_aaa, encoding="iso-8859-15", skiprows=0, sep=";", dtype={"zz id": np.int32})
[... some data cleaning...]
# open and close because there were some errors when the hdf stores was initially created and
# immediately written to. Not sure if that is necessary anymore.
store = pandas.HDFStore(hdf_store_name)
store.close()
store = pandas.HDFStore(hdf_store_name)
store[hdf_aaa] = aaa
store.close()
C:\portabel\Python27\lib\site-packages\pandas\io\pytables.py:1788: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot map
directly to c-types [inferred_type->mixed,key->axis0]
warnings.warn(ws, PerformanceWarning)
C:\portabel\Python27\lib\site-packages\pandas\io\pytables.py:1788: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot map
directly to c-types [inferred_type->unicode,key->block0_items]
warnings.warn(ws, PerformanceWarning)
C:\portabel\Python27\lib\site-packages\pandas\io\pytables.py:1788: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot map
directly to c-types [inferred_type->mixed,key->block2_values]
warnings.warn(ws, PerformanceWarning)
C:\portabel\Python27\lib\site-packages\pandas\io\pytables.py:1788: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot map
directly to c-types [inferred_type->unicode,key->block2_items]
warnings.warn(ws, PerformanceWarning)
aaa.dtypes
title object
a object
b float64
c float64
d float64
e float64
f float64
g float64
h object
i object
j int32
k int32
l int32
m int32
n int32
o int32
p int32
dtype: object
The objects are strings of variable length (some are paragraph length).
Performance is not a problem (~seconds? or less than a second, even for my biggest data file, which has ~300k rows), so I don't mind the time it takes, just the warnigns which make my IPython notebook longer and harder to read the important parts.
the open/close twice should not be necessary
can u post
df._data.blocks?
not sure if u can but would help if u post your data file (a link on say Dropbox)
can do privately if u want
are some of your object
columns actually unicode
? this could definitly trigger this
print journals._data.blocks
[FloatBlock: [SNIP2_2009, SJR2_2009, SNIP2_2010, SJR2_2010, SNIP2_2011, SJR2_2011], 6 x 32059, dtype float64, IntBlock: [sjr2_2011_top10_overall, sjr2_2011_top10_nano, sjr2_2011_top10_business, sjr2_2011_top10_BusinessManagementAccounting, sjr2_2011_top10_MaterialsScience, articles_count, sjr2_2011_top10], 7 x 32059, dtype int32, ObjectBlock: [title, ISSN, BusinessManagementAccounting, MaterialsScience], 4 x 32059, dtype object]
type(journals.iloc[0,0]) # This is the "title" column
unicode
Try getting rid of the unicode
In [27]: x = 'foo'
In [28]: type(x)
Out[28]: str
In [29]: type(x.decode('utf-8'))
Out[29]: unicode
you may need something like
df['column_with_unicode'] = df['column_with_unicode'].apply(lamda x: x.decode('utf-8'))
FYI very soon (with the release of PyTables 3.0) I think we will be able to support unicode
Then I will simple wait until that happens. Right now the performance is no problem, just the annoying warnings :-)
the warning is just to alert the user that u r basically pickling those fields rather than storing then in a c-type
u can filter the warnings as well
import warnings
warnings.filterwarnings('ignore',category=pandas.io.pytables.PerformanceWarning)
closing for now, @JanSchulz reopen/new issue if you have questions/concerns
Hi @jreback , im on pytables 3 (tables==3.2.0) and am still facing the same issue as @JanSchulz - warnings when i try to save my 'df' as 'h5'. My data frame does contain unicode. Any thing i can do to avoid them ?
make sure you are storing with format='table'
py3 handles the Unicode
pls show code and version if this doesn't work
I found a weird case when I ran the same command the second time then that warning disappeared:
PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot map
directly to c-types [inferred_type->mixed,key->block0_values]
f.to_hdf("dataset_test.h5", key="test")
P.S. I ran it in interactive mode, version: python==3.6.7, pandas==0.23.4
P.P.S Hmm I guess this is its behavior. Not sure though.
Most helpful comment
the warning is just to alert the user that u r basically pickling those fields rather than storing then in a c-type
u can filter the warnings as well