Keras: H5Py dataset for input tensor?

Created on 2 Sep 2015  Â·  19Comments  Â·  Source: keras-team/keras

This isn't a bug, but seems a little technical for the user group.

I have an h5py dataset with shape (300,000, 60, 50). My 8GB ram laptop will not consider this an acceptable transaction.

However, loading the h5py dataset straight into the model (i.e. without creating a numpy array from it) throws an error.

Am I doing something wrong, or is the h5py format not supported as an input tensor in either Theano or Keras?

stale

Most helpful comment

Was this ever fixed? Is it possible to use model.fit() with huge HDF5 files that don't fit in RAM?

All 19 comments

I think there may be a bug. It is possible to input from h5py, but not if validation_split = 0.0.

With validation split = 0.001 it works fine. I have just lost the error trace but will try to get it back - the error was coming from the validation_split method in Sequential().

This is very useful thank you - not having to load into memory is a god send, and it's working - but I think there is a bug. Without setting validation_split to a value above zero (even very small works, e.g. 0.001), the following error trace appears, like when I tried directly with the h5py dataset:

Start training.
Epoch 0
Traceback (most recent call last):
File "D:/Final project/TimeNLP/train.py", line 90, in
train(xf, yf, model, batch_size=200, name=name)
File "D:/Final project/TimeNLP/train.py", line 44, in train
verbose=1)
File "C:\Anaconda\lib\site-packages\keras\models.py", line 413, in fit
validation_split=validation_split, val_f=val_f, val_ins=val_ins, shuffle=shuffle, metrics=metrics)
File "C:\Anaconda\lib\site-packages\keras\models.py", line 162, in _fit
ins_batch = slice_X(ins, batch_ids)
File "C:\Anaconda\lib\site-packages\keras\models.py", line 44, in slice_X
return [x[start] for x in X]
File "C:\Anaconda\lib\site-packages\keras\utils\io_utils.py", line 48, in __getitem__
return self.data[idx]
File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (D:\Build\h5py\h5py-2.5.x\h5py_objects.c:2475)
File "h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (D:\Build\h5py\h5py-2.5.x\h5py_objects.c:2432)
File "C:\Anaconda\lib\site-packages\h5py_hl\dataset.py", line 431, in __getitem__
selection = sel.select(self.shape, args, dsid=self.id)
File "C:\Anaconda\lib\site-packages\h5py_hl\selections.py", line 95, in select
sel[args]
File "C:\Anaconda\lib\site-packages\h5py_hl\selections.py", line 429, in getitem
raise TypeError("Indexing elements must be in increasing order")
TypeError: Indexing elements must be in increasing order

I think if you just put a case for when validation split is zero and assign one datapoint as a throwaway validation tensor it'll work, although sounds a bit hacky.

I should also mention, there is a 10 minute period when starting model.train when my machine complete hangs because of memory. After that it clears up. I also experienced this using h5py - are there optimal ways to build the h5py files to avoid these memory issues?

Just to add to that, it seems I spoke too soon - there is still an error even with validation_split. The following error does not appear when loading the dataset from the h5py file - I successfully trained the model with that file last night.

Traceback (most recent call last):
File "D:/Final project/TimeNLP/train.py", line 90, in
train(xf, yf, model, batch_size=200, name=name)
File "D:/Final project/TimeNLP/train.py", line 44, in train
verbose=1)
File "C:\Anaconda\lib\site-packages\keras\models.py", line 413, in fit
validation_split=validation_split, val_f=val_f, val_ins=val_ins, shuffle=shuffle, metrics=metrics)
File "C:\Anaconda\lib\site-packages\keras\models.py", line 128, in _fit
(ins, val_ins) = (slice_X(ins, 0, split_at), slice_X(ins, split_at))
File "C:\Anaconda\lib\site-packages\keras\models.py", line 46, in slice_X
return [x[start:stop] for x in X]
File "C:\Anaconda\lib\site-packages\keras\utils\io_utils.py", line 26, in __getitem__
if key.stop + self.start <= self.end:
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

Do you have a concrete usage example of HDF5Matrix for a large 3d tensor? Maybe from file creation up to training and prediction, so I can try to replicate a case when it works, and maybe we can sus out what the issue is?

This is not a bug, but simply a limitation of HDF5 input: indexing has to
be in ascending order, which means that the data cannot be shuffled. By
default Keras shuffles the data (shuffle=True argument in fit). By
setting shuffle=False this issue should go away.

On 3 September 2015 at 02:40, cjmcmurtrie [email protected] wrote:

Do you have a concrete usage example of HDF5Matrix for a large 3d tensor?
Maybe from file creation up to training and prediction, so I can try to
replicate a case when it works, and maybe we can sus out what the issue is?

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/628#issuecomment-137392553.

There's something still not right. I'm sure this is a mixture of my own decisions and an unclear interface between Keras and H5Py.

Firstly, the HFD5Matrix outputs a matrix shape (x, y) rather than the original (x, y, z) tensor in the dataset.

Secondly, with shuffle=False, the error still appears if validation_split=0.0. The only way I can input an hdf5 dataset (from the Keras util or my own h5Py code) is setting some value for validation_split, e.g. 0.001. However:

Thirdly, my computer hangs dramatically, for about an hour, when I feed the H5Py dataset into model.fit (I'm writing this from a different machine).

My hunch is that somewhere, mode.fit is creating a massive Numpy tensor when it gets started (maybe when splitting for validation?). It hangs for ages on this and then throws away the arrays and thing start running faster during training.

Once my machine has recovered I can try to investigate further.

One thing I notice is that in model.fit source there is no clause for validation_split == 0.0:

if validation_data or validation_split:
    ...
if validation_data:
    ...
elif 0 < validation_split < 1:
    ...

And that's it.

Also, I doubt all these operations in model.fit on X and Y can be handled from disk with H5Py, e.g.

X, X_val = (slice_X(X, 0, split_at), slice_X(X, split_at))

Incidentally, when is it smart to use model.train_on_batch?

I'm feeling just now something like this is the only way I'll get this working, in smaller batches of say 10,000 (correction: 50,000!).

Just FYI, my current solution is to run batches like I suggested above of 50,000 at a time in model.fit - these are then further batched into the training batch size (200 for me). My results seem to be improving over epochs training this way. I'm fairly sure the memory bottleneck is creation of Numpy objects from the hdf5 datasets - when I have a little more free time I could try to work with you on a fix.

I'm also trying to use HDF5 as direct input (instead of a dictionary with
numpy objects). Anyone got HDF5 to work in Keras this way?
I've set _shuffle='batch'_ in model.fit.

However, the program crashes in this line
https://github.com/fchollet/keras/blob/9807dcd69b8792510fc35cabd1ed746a1e59988c/keras/models.py#L201
:

batch_ids = index_array[batch_start:batch_end] try: ins_batch =
slice_X(ins, batch_ids) except TypeError as err: raise Exception('TypeError
while preparing batch. \ If using HDF5 input data, pass shuffle="batch".\n')

TypeError: PointSelection getitem only works with bool arrays

Any hints? I get the same error passing shuffle=False.

This is the stack trace:

/home/jonil/python/keras/keras/models.pyc in fit(self, data, batch_size,
nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle,
class_weight, sample_weight)
688 verbose=verbose, callbacks=callbacks,
689 val_f=val_f, val_ins=val_ins,
--> 690 shuffle=shuffle, metrics=metrics)
691 return history
692

/home/jonil/python/keras/keras/models.pyc in _fit(self, f, ins, out_labels,
batch_size, nb_epoch, verbose, callbacks, val_f, val_ins, shuffle, metrics)
199 batch_ids = index_array[batch_start:batch_end]
200 try:
--> 201 ins_batch = slice_X(ins, batch_ids)
202 except TypeError as err:
203 print('TypeError while preparing batch. \

/home/jonil/python/keras/keras/models.pyc in slice_X(X, start, stop)
53 if type(X) == list:
54 if hasattr(start, 'len'):
---> 55 return [x[start] for x in X]
56 else:
57 return [x[start:stop] for x in X]

h5py/_objects.pyx in h5py._objects.with_phil.wrapper
(-------src-dir-------/h5py/_objects.c:2458)()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper
(-------src-dir-------/h5py/_objects.c:2415)()

/home/jonil/anaconda/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in
__getitem__(self, args)
429
430 # Perform the dataspace selection.
--> 431 selection = sel.select(self.shape, args, dsid=self.id)
432
433 if selection.nselect == 0:

/home/jonil/anaconda/lib/python2.7/site-packages/h5py/_hl/selections.pyc in
select(shape, args, dsid)
77 elif isinstance(arg, np.ndarray):
78 sel = PointSelection(shape)
---> 79 sel[arg]
80 return sel
81

/home/jonil/anaconda/lib/python2.7/site-packages/h5py/_hl/selections.pyc in
__getitem__(self, arg)
215 """ Perform point-wise selection from a NumPy boolean array
"""
216 if not (isinstance(arg, np.ndarray) and arg.dtype.kind ==
'b'):
--> 217 raise TypeError("PointSelection getitem only works
with bool arrays")
218 if not arg.shape == self.shape:
219 raise TypeError("Boolean indexing array has
incompatible shape")

TypeError: PointSelection getitem only works with bool arrays

On Fri, Sep 4, 2015 at 12:29 PM, cjmcmurtrie [email protected]
wrote:

Just FYI, my current solution is to run batches like I suggested above of
50,000 at a time in model.fit - these are then further batched into the
training batch size (200 for me). My results seem to be improving over
epochs training this way. I'm fairly sure the memory bottleneck is creation
of Numpy objects from the hdf5 datasets - when I have a little more free
time I could try to work with you on a fix.

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/628#issuecomment-137688658.

I am experiencing similar trouble. I tried to let .hdf5 file be my input for training, but failed with the same error as above.

my code:
model.fit(X_train, y_train, batch_size=16, nb_epoch=20, verbose=1, validation_split=0.001,
validation_data=None, shuffle=batch, show_accuracy=True, callbacks=[], class_weight=None,sample_weight=None)

error:
Traceback (most recent call last):
File "mlp.py", line 26, in
validation_data=None, shuffle=batch, show_accuracy=True, callbacks=[], class_weight=None,sample_weight=None)
File "/Users/jinyoungchoi/anaconda/lib/python2.7/site-packages/keras/models.py", line 490, in fit
shuffle=shuffle, metrics=metrics)
File "/Users/jinyoungchoi/anaconda/lib/python2.7/site-packages/keras/models.py", line 205, in _fit
If using HDF5 input data, pass shuffle="batch".\n')
Exception: TypeError while preparing batch. If using HDF5 input data, pass shuffle="batch".

Neither shuffle=batch or false worked and threw the same error.
I've briefly debugged models.py, and could see that it's not a problem of shuffle's value (it certainly was 'batch'). Seems like error happening somewhere in slice_X() function. Anyone have further idea? Will post if I get any further clues..

Was this ever fixed? Is it possible to use model.fit() with huge HDF5 files that don't fit in RAM?

Any updates here?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

In case anyone lands here with the above questions, see https://gist.github.com/jfsantos/e2ef822c744357a4ed16ec0c885100a3#gistcomment-2325616

Finally got what happened. The problem has nothing to do with keras. The point is that whenever you slice hdf5 file, you need to make sure the indexes input are in increasing order. So after shuffling the indexes list and generating a batch of indexes, sort this small batch of indexes may solve the problem.

Thank you @Darkhunter9! Sorting the indexes worked for me. I just added:

def __getitem__(self, key):
    key = sorted(key)
    data = super().__getitem__(key)
    ...

in my HDF5Matrix subclass

Was this page helpful?
0 / 5 - 0 ratings

Related issues

vinayakumarr picture vinayakumarr  Â·  3Comments

fredtcaroli picture fredtcaroli  Â·  3Comments

zygmuntz picture zygmuntz  Â·  3Comments

anjishnu picture anjishnu  Â·  3Comments

snakeztc picture snakeztc  Â·  3Comments