Keras: How to batch train with fit_generator() with HDF5?

Created on 4 Mar 2018  路  2Comments  路  Source: keras-team/keras

Apologies if this is the wrong place to raise my issue (please help me out with where best to raise it if that's the case).

I'm trying to train a CNN model that takes images as input. It's a fairly large dataset (5gb of images), so I created a custom generator to work with fit_generator(). I've tried storing dataset in a HDF5 file and in a directory on disk. Unlike fit(), fit_generator() doesn't allow shuffle="batch" which is meant for working with HDF5. With shuffle=True, my understanding is that the HDF5 is randomly accessed eliminating any performance benefit over reading raw images straight from disk. For both HDF5 and raw jpegs, each step takes about 550ms and each epoch takes around 20mins. Is there a more efficient way to train using HDF5 (with shuffling)?

Any thought appreciated.

Most helpful comment

Unfortunately, I think this is a problem for a random traversal across any large dataset, regardless of data source. Unless you can cache a significant fraction of the data, many (most?) accesses will be a cache miss and you are going to have to go to the disk, with resulting poor performance. HDF5 is probably still better than a pile of files, though, since you can hold the file open instead of opening and closing your individual image files and the chunk cache may still help you if it's large enough. If you have control of how the HDF5 file is structured, make sure you are using a good layout for your data access pattern: I'd be sure to use the latest HDF5 file format and I'd put all my images into one dataset, with a chunk size equal to the size of a single image (smaller chunks will just increase the amount of metadata and the number of file accesses, larger chunks are probably going to result in larger I/O operations with little benefit). You might also try increasing the size of the chunk cache and seeing if that helps, but it sounds like you'd have to make it quite large to have a good chance at finding a random image in the cache.

All 2 comments

were you able to figure this out

Unfortunately, I think this is a problem for a random traversal across any large dataset, regardless of data source. Unless you can cache a significant fraction of the data, many (most?) accesses will be a cache miss and you are going to have to go to the disk, with resulting poor performance. HDF5 is probably still better than a pile of files, though, since you can hold the file open instead of opening and closing your individual image files and the chunk cache may still help you if it's large enough. If you have control of how the HDF5 file is structured, make sure you are using a good layout for your data access pattern: I'd be sure to use the latest HDF5 file format and I'd put all my images into one dataset, with a chunk size equal to the size of a single image (smaller chunks will just increase the amount of metadata and the number of file accesses, larger chunks are probably going to result in larger I/O operations with little benefit). You might also try increasing the size of the chunk cache and seeing if that helps, but it sounds like you'd have to make it quite large to have a good chance at finding a random image in the cache.

Was this page helpful?
0 / 5 - 0 ratings