Keras: h5py for training on large datasets

Created on 9 May 2016  路  10Comments  路  Source: keras-team/keras

I've pieced together from @edersantana and @jfsantos's commit history that there's HDF5 support for training Keras models on large datasets that cannot fit into RAM, but I've been unable to find instructions. The feature doesn't seem particularly documented, in fact.

I wonder if it's still actively supported, and if there are usage instructions somewhere. Just an example of how the HDF5 should be formatted, at least. Does it support multiple inputs/outputs, and so on?

Sorry if I'm missing something obvious!

Most helpful comment

I'm really sorry, as I ended up never writing documentation for the feature. You basically have to call HDF5Matrix and pass the path to an HDF5 file and the name of the dataset inside that file that you want HDF5Matrix to point to. start and end enable you to make HDF5Matrix represent a slice of the dataset instead of the whole dataset (useful if you want to split the data somehow). You can also pass a function (or Python lambda) as a normalizer, which will be called on each slice of data you try to get from the matrix before returning the data to you.

After creating an instance of HDF5Matrix, you can pass it to model.fit as if it were a Numpy array. Note that allowing Keras to randomize minibatches will make it super-slow, as HDF5 is not well-suited for this kind of access. A workaround is to randomize your samples before saving them as HDF5 and then pass shuffle='batch' to model.fit.

If anyone wants to turn this into actual documentation, that would be great :)

All 10 comments

here is the feature:
https://github.com/fchollet/keras/blob/90aafca585ca92d7ff558d811ff8eb8d60d7c3d4/keras/utils/io_utils.py#L7-L52

@carlthome and @jfsantos we kinda need to unit test that to serve as an example for how to use it. does anybody have time for that?

I'm really sorry, as I ended up never writing documentation for the feature. You basically have to call HDF5Matrix and pass the path to an HDF5 file and the name of the dataset inside that file that you want HDF5Matrix to point to. start and end enable you to make HDF5Matrix represent a slice of the dataset instead of the whole dataset (useful if you want to split the data somehow). You can also pass a function (or Python lambda) as a normalizer, which will be called on each slice of data you try to get from the matrix before returning the data to you.

After creating an instance of HDF5Matrix, you can pass it to model.fit as if it were a Numpy array. Note that allowing Keras to randomize minibatches will make it super-slow, as HDF5 is not well-suited for this kind of access. A workaround is to randomize your samples before saving them as HDF5 and then pass shuffle='batch' to model.fit.

If anyone wants to turn this into actual documentation, that would be great :)

I could absolutely type up some documentation and add tests, but it would be very helpful to see a basic code example with dummy data (just np.random or something).

Am I right in assuming that data is read from HDD, and not RAM when passing a HDF5Matrix to model.fit?

Here is a commented example on how to use HDF5Matrix: https://gist.github.com/jfsantos/e2ef822c744357a4ed16ec0c885100a3.

@carlthome Yes, the data is always read from the disk and not from RAM. HDF5Matrix is just a thin interface over h5py, so you can check the docs for h5py for more info.

@jfsantos I could take a stab at writing documentation for this. There's a couple of things I'm not sure about though. Shouldn't start and end be optional? And shouldn't there be a way to close the open files?

Actually, never mind about the closing thing since I realized it shouldn't matter since we're only reading.

Great job @mezzode!

Could you add an example for using normalizer? What kind of transformations are possible? Can I for example resize images?

Hi @michaelschleiss ,

I've found that this works just fine. I imagine anything else would work similarly:

def normboth(x):
x = x.astype("float32")
rgb = np.load(I have a path for rgb mean values here)
x -= rgb
x /= 255.
return(x)

trainImages = HDF5Matrix(inDir+hdf5Name, 'train_img', normalizer = normboth)

Hi guys, I found a wonderful solution to combine HDF5Matrix, generator and ImageDataGenerator.
That means you can random select batch data with data augmented fron HDF5 file by generator.
Here is link.

Was this page helpful?
0 / 5 - 0 ratings