In [1]: import io
In [2]: xx = io.BytesIO()
In [4]: from pandas import DataFrame
In [5]: from numpy.random import randn
In [6]: df = DataFrame(randn(1000000,2),columns=list('AB'))
----> 1 df.to_hdf(xx,'xx')
/root/pyenvs/aft_env/lib/python3.4/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, **kwargs)
900
901 from pandas.io import pytables
--> 902 return pytables.to_hdf(path_or_buf, key, self, **kwargs)
903
904 def to_msgpack(self, path_or_buf=None, **kwargs):
/root/pyenvs/aft_env/lib/python3.4/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, **kwargs)
267 f(store)
268 else:
--> 269 f(path_or_buf)
270
271
/root/pyenvs/aft_env/lib/python3.4/site-packages/pandas/io/pytables.py in <lambda>(store)
260 f = lambda store: store.append(key, value, **kwargs)
261 else:
--> 262 f = lambda store: store.put(key, value, **kwargs)
263
264 if isinstance(path_or_buf, string_types):
AttributeError: '_io.BytesIO' object has no attribute 'put'
You cannot do this with HDF5. Instead you can see below, where you create an in-memory store. I thought this was docced somewhere but I guess not.
In [15]: df = DataFrame(1,columns=list('ab'),index=range(3))
In [17]: store = pd.HDFStore('foo.h5',driver='H5FD_CORE',driver_core_backing_store=0)
In [18]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback staff 27472602 Dec 2 06:54 test2.h5
-rw-rw-r-- 1 jreback staff 88416100 Jan 4 14:58 test.h5
In [19]: store.append('df',df)
In [20]: store.select('df')
Out[20]:
a b
0 1 1
1 1 1
2 1 1
In [21]: store.close()
In [22]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback staff 27472602 Dec 2 06:54 test2.h5
-rw-rw-r-- 1 jreback staff 88416100 Jan 4 14:58 test.h5
Note that in general these in-memory stores are a fair bit slower than just keeping the data in-memory. So not a whole lot of reason to use them (except perhaps to identical selection semantics to an on-disk store).
Is there a way to get the contents of that file as bytes? I would like to send it over the network to another machine, without writing it to disk.
If not HDF5, do you have any other recommendations that could possibly compress (ideally using blosc) the DataFrame and give us the bytes to send over the wire. I guess writing to a ramdisk is an option, but it seems overkill. msgpack doesn't seem to support compression on the read yet.
you can use msgpack thrn compress the bytes
Yes, that kind of works -- though not ideal. I found out that you can just do:
bindata = store._handle.get_file_image()
and that might be a better work around. Can we expose that as part of an official api somehow so that it doesn't break the next time I upgrade pandas?
Let' start off with a cookbook recipe for this. Would you like to do a PR for that?
Sure
There is a (not merged) pull request that implemented just that: https://github.com/pydata/pandas/pull/6519
I've been using the following functions in production for quite a while:
def read_hdf_from_buffer(buffer, key="/data"):
from pandas import get_store
with get_store(
"data.h5",
mode="r",
driver="H5FD_CORE",
driver_core_backing_store=0,
driver_core_image=buffer.read()
) as store:
return store[key]
def write_hdf_to_buffer(df):
from pandas import get_store
with get_store(
"data.h5", mode="a", driver="H5FD_CORE",
driver_core_backing_store=0
) as out:
out["/data"] = df
return out._handle.get_file_image()
@filmor I've implemented this and it appears to work well, thanks! One issue that I've noticed is that returning the get_file_image() works for a single key, but you can't simply concatenate the binary file data together from multiple calls to write_hdf_to_buffer(). How have you accounted for this, or are you simply writing a single key and associated data to individual files?
@argoneus I'm not sure I get what you mean. The particular functions given are implemented for just a single key, but nothing prevents you from getting/storing more than one key as long as the store object exists. What exactly are you trying to do?
@filmor I'm writing an HDF5 file in Azure Data Lake (ADL), and am using their Python APIs to open a buffer. I'd like to write multiple keys and associated data to the same file. Using your write_hdf_to_buffer(), it appears that the binary data is being written, but the keys aren't getting updated in the file. In my test, I have 4 pandas data frames with 4 unique keys, and want to write these to an .h5 file in ADL. The write()s seemingly work properly, but the resulting file only has one key when I do a hdf.keys() instead of 4. When I run the test dataset and write to local disk (using standard df.to_hdf()), the keys are all there of course. The file sizes of the ADL file and the local file match, which leads me to believe the df contents are written, but the keys aren't being updated, since I'm seeing 1 instead of 4 for the ADL .h5 file. I'll keep experimenting with it; was just curious if you had to do anything extra to support multiple keys in the same file with your approach.
Just return the store directly in the read function and write multiple frames in the write:
def write_hdf_to_buffer(frames):
from pandas import get_store
with get_store(
"data.h5", mode="a", driver="H5FD_CORE",
driver_core_backing_store=0
) as out:
for key, df in frames.items():
out[key] = df
return out._handle.get_file_image()
def read_hdf_from_buffer(buffer):
from pandas import get_store
return get_store(
"data.h5",
mode="r",
driver="H5FD_CORE",
driver_core_backing_store=0,
driver_core_image=buffer.read()
)
Yes, that works -- reading the store file contents each time and returning to write_hdf_to_buffer() prior to adding the new dataframe. However, I was trying to do this without having to read the entire contents (from ADL) each time -- this is inefficient if processing a few hundred keys. I'm using Python's multiprocessing library and handling a key per process (reading data from input, processing, writing to output HDF5), so can't easily concatenate the dataframes and write all content at once.
I was envisioning something akin to:
write first group dataframe to buffer, creating HDF5
write second group dataframe to buffer, appending to HDF5 (in 'ab' mode) and update keys (HDF5 root group). Here is where I'm not sure that it's possible in append mode, if the keys are stored in the file header.
....
This would avoid the re-read each time and simply append the new groups content to the file, updating the keys as appropriate. Not sure if it's possible though.
Most helpful comment
There is a (not merged) pull request that implemented just that: https://github.com/pydata/pandas/pull/6519
I've been using the following functions in production for quite a while: