Pandas: df.to_hdf() does not support BytesIO

Created on 14 Jan 2015 · 12Comments · Source: pandas-dev/pandas

In [1]: import io
In [2]: xx = io.BytesIO()
In [4]: from pandas import DataFrame
In [5]: from numpy.random import randn
In [6]: df = DataFrame(randn(1000000,2),columns=list('AB'))

----> 1 df.to_hdf(xx,'xx')

/root/pyenvs/aft_env/lib/python3.4/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, **kwargs)
    900 
    901         from pandas.io import pytables
--> 902         return pytables.to_hdf(path_or_buf, key, self, **kwargs)
    903 
    904     def to_msgpack(self, path_or_buf=None, **kwargs):

/root/pyenvs/aft_env/lib/python3.4/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, **kwargs)
    267             f(store)
    268     else:
--> 269         f(path_or_buf)
    270 
    271 

/root/pyenvs/aft_env/lib/python3.4/site-packages/pandas/io/pytables.py in <lambda>(store)
    260         f = lambda store: store.append(key, value, **kwargs)
    261     else:
--> 262         f = lambda store: store.put(key, value, **kwargs)
    263 
    264     if isinstance(path_or_buf, string_types):

AttributeError: '_io.BytesIO' object has no attribute 'put'

API Design IO HDF5

Source

jennolsen84

Most helpful comment

There is a (not merged) pull request that implemented just that: https://github.com/pydata/pandas/pull/6519

I've been using the following functions in production for quite a while:

def read_hdf_from_buffer(buffer, key="/data"):
     from pandas import get_store
     with get_store(
            "data.h5",
            mode="r",
            driver="H5FD_CORE",
            driver_core_backing_store=0,
            driver_core_image=buffer.read()
            ) as store:
        return store[key]

def write_hdf_to_buffer(df):
    from pandas import get_store
    with get_store(
            "data.h5", mode="a", driver="H5FD_CORE",
            driver_core_backing_store=0
            ) as out:
        out["/data"] = df
        return out._handle.get_file_image()

filmor on 12 Feb 2015

👍6

All 12 comments

You cannot do this with HDF5. Instead you can see below, where you create an in-memory store. I thought this was docced somewhere but I guess not.

In [15]: df = DataFrame(1,columns=list('ab'),index=range(3))

In [17]: store = pd.HDFStore('foo.h5',driver='H5FD_CORE',driver_core_backing_store=0)

In [18]: !ls -ltr *.h5
-rw-rw-r--  1 jreback  staff  27472602 Dec  2 06:54 test2.h5
-rw-rw-r--  1 jreback  staff  88416100 Jan  4 14:58 test.h5

In [19]: store.append('df',df)

In [20]: store.select('df')
Out[20]: 
   a  b
0  1  1
1  1  1
2  1  1

In [21]: store.close()

In [22]: !ls -ltr *.h5
-rw-rw-r--  1 jreback  staff  27472602 Dec  2 06:54 test2.h5
-rw-rw-r--  1 jreback  staff  88416100 Jan  4 14:58 test.h5

Note that in general these in-memory stores are a fair bit slower than just keeping the data in-memory. So not a whole lot of reason to use them (except perhaps to identical selection semantics to an on-disk store).

jreback on 14 Jan 2015

Is there a way to get the contents of that file as bytes? I would like to send it over the network to another machine, without writing it to disk.

If not HDF5, do you have any other recommendations that could possibly compress (ideally using blosc) the DataFrame and give us the bytes to send over the wire. I guess writing to a ramdisk is an option, but it seems overkill. msgpack doesn't seem to support compression on the read yet.

jennolsen84 on 14 Jan 2015

you can use msgpack thrn compress the bytes

jreback on 15 Jan 2015

Yes, that kind of works -- though not ideal. I found out that you can just do:

bindata = store._handle.get_file_image()

and that might be a better work around. Can we expose that as part of an official api somehow so that it doesn't break the next time I upgrade pandas?

jennolsen84 on 15 Jan 2015

Let' start off with a cookbook recipe for this. Would you like to do a PR for that?

jreback on 16 Jan 2015

Sure

jennolsen84 on 25 Jan 2015

There is a (not merged) pull request that implemented just that: https://github.com/pydata/pandas/pull/6519

I've been using the following functions in production for quite a while:

def read_hdf_from_buffer(buffer, key="/data"):
     from pandas import get_store
     with get_store(
            "data.h5",
            mode="r",
            driver="H5FD_CORE",
            driver_core_backing_store=0,
            driver_core_image=buffer.read()
            ) as store:
        return store[key]

def write_hdf_to_buffer(df):
    from pandas import get_store
    with get_store(
            "data.h5", mode="a", driver="H5FD_CORE",
            driver_core_backing_store=0
            ) as out:
        out["/data"] = df
        return out._handle.get_file_image()

filmor on 12 Feb 2015

👍6

@filmor I've implemented this and it appears to work well, thanks! One issue that I've noticed is that returning the get_file_image() works for a single key, but you can't simply concatenate the binary file data together from multiple calls to write_hdf_to_buffer(). How have you accounted for this, or are you simply writing a single key and associated data to individual files?

argoneus on 17 May 2017

@argoneus I'm not sure I get what you mean. The particular functions given are implemented for just a single key, but nothing prevents you from getting/storing more than one key as long as the store object exists. What exactly are you trying to do?

filmor on 17 May 2017

@filmor I'm writing an HDF5 file in Azure Data Lake (ADL), and am using their Python APIs to open a buffer. I'd like to write multiple keys and associated data to the same file. Using your write_hdf_to_buffer(), it appears that the binary data is being written, but the keys aren't getting updated in the file. In my test, I have 4 pandas data frames with 4 unique keys, and want to write these to an .h5 file in ADL. The write()s seemingly work properly, but the resulting file only has one key when I do a hdf.keys() instead of 4. When I run the test dataset and write to local disk (using standard df.to_hdf()), the keys are all there of course. The file sizes of the ADL file and the local file match, which leads me to believe the df contents are written, but the keys aren't being updated, since I'm seeing 1 instead of 4 for the ADL .h5 file. I'll keep experimenting with it; was just curious if you had to do anything extra to support multiple keys in the same file with your approach.

argoneus on 17 May 2017

Just return the store directly in the read function and write multiple frames in the write:

def write_hdf_to_buffer(frames):
    from pandas import get_store
    with get_store(
            "data.h5", mode="a", driver="H5FD_CORE",
            driver_core_backing_store=0
            ) as out:
        for key, df in frames.items():
            out[key] = df
        return out._handle.get_file_image()

 def read_hdf_from_buffer(buffer):
      from pandas import get_store
      return get_store(
             "data.h5",
             mode="r",
             driver="H5FD_CORE",
             driver_core_backing_store=0,
             driver_core_image=buffer.read()
             )

filmor on 17 May 2017

Yes, that works -- reading the store file contents each time and returning to write_hdf_to_buffer() prior to adding the new dataframe. However, I was trying to do this without having to read the entire contents (from ADL) each time -- this is inefficient if processing a few hundred keys. I'm using Python's multiprocessing library and handling a key per process (reading data from input, processing, writing to output HDF5), so can't easily concatenate the dataframes and write all content at once.

I was envisioning something akin to:
write first group dataframe to buffer, creating HDF5
write second group dataframe to buffer, appending to HDF5 (in 'ab' mode) and update keys (HDF5 root group). Here is where I'm not sure that it's possible in append mode, if the keys are stored in the file header.
....

This would avoid the re-read each time and simply append the new groups content to the file, updating the keys as appropriate. Not sure if it's possible though.

argoneus on 17 May 2017

Was this page helpful?

0 / 5 - 0 ratings