Keras: fit_generator broken in 0.3.2?

Created on 29 Feb 2016  Â·  12Comments  Â·  Source: keras-team/keras

After upgrading to 0.3.2 I noticed that model fitting with fit_generator is slower than it used to be. I tested this with a simple LSTM net (theano backed) on a single gpu core AWS instance, and found that 0.3.2 is about 20 times slower than 0.3.1. Any ideas to why that is? The differemce in speed is sort of similar to using cpu vs. gpu I think.

Also, somewhat related, the new nb_val_samples that recently appeared in the docs is not actually implemented in 0.3.2 (but is in the git version).

stale

All 12 comments

If you can track that to a specific commit, I will look at it.

On 29 February 2016 at 14:09, Tomer Tal [email protected] wrote:

After upgrading to 0.3.2 I noticed that model fitting with fit_generator
is slower than it used to be. I tested this with a simple LSTM net (theano
backed) on a single gpu core AWS instance, and found that 0.3.2 is about 20
times slower than 0.3.1. Any ideas to why that is? The differemce in speed
is sort of similar to using cpu vs. gpu I think.

Also, somewhat related, the new nb_val_samples that recently appeared in
the docs is not actually implemented in 0.3.2 (but is in the git version).

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/1857.

Not sure - whichever one associated with the pypi release version 0.3.2 that was committed a couple of weeks ago.

This one I guess: https://github.com/fchollet/keras/commit/d1a3842b3d476aaf8479ec4dc3a9eff6ad35e8b6#diff-98d8790a4331685d2a81334c398c4b58

@qdbp thoughts?

Not sure if related, but validation_split no longer works on 0.3.2 either.

Traceback (most recent call last):
  File "/home/keo7/.projects/neural-network-plant-trait-classification/neural_networks/flower_colour.py", line 152, in <module>
    model = train_model(model, data[train], labels[train], number_of_classes)
  File "/home/keo7/.projects/neural-network-plant-trait-classification/neural_networks/flower_colour.py", line 82, in train_model
    nb_worker=1)
TypeError: fit_generator() got an unexpected keyword argument 'validation_split'

This is likely related to some changes I had made, I am looking into it.

@ttal Are you using a generator for validation data as well, or just for the training?

@qdbp just for training - I load a subset into memory for validation so that I can compare side-by-side with memory-based training without validation being an issue.

Can anyone confirm if the new release works?

not sure what your most recent commit does but it's not working for me. The main issue (huge slowdown when I use fit_generator) is still there. The secondary issue (nb_val_samples not implemented) is still there as far as I can tell.

@KeironO I don't recall fit_generator ever having a validation_split parameter, since iirc that is for breaking off a validation set from a fixed datset.
@ttal That's unfortunate. I checked that create_gen_queue function operates correctly (at least now that it actually gets the right nb_worker). That it's training that's slow eliminates evaluate_generator as the culprit. verify_generator_output also seems safe, and is old code. Perhaps I should do some profiling.

The nb_val_samples argument was recently added in a commit (10-ish days ago). There has been no release since this was added.

@fchollet when is the next release planned?

@gpleiss it is in the docs nonetheless.

Any thoughts about why fi_generator is so much slower in the latest release than in previous ones?

I'm debugging this now: It _is_ generator_queue that's misbehaving. I puttime.sleep() in a generator function and launched it with a large nb_worker. Successive samples to seem to be produced by different threads, but the sleep appears shared across all of them. I'm not sure why this is the case, since iirc time.sleep should only sleep the thread; perhaps whatever is causing this is also causing the slowdown. This behaviour also occurs with the old, inline generator queue code: is a potential problem, I wonder.

@ttal However, I was not able to reproduce a fit_generator, slowdown with a modification of the keras MNIST example between the latest master and d1a3842 neither in python 2 nor python 3, with both of these training comparably fast slower than fit. In 0.3.1 the code actually crashes because

try:
    generator_output = next(generator)
except ValueError:
    continue

that except ValueError was not yet introduced. This sets _stop immediately as the code tries to run the generator from multiple threads at once. @fchollet this suggests to me that as currently implemented, the entire multithreading setup is flawed: nb_worker threads are fighting over who will run the generator, but it's only ever executed in one thread at once. This appears to include GIL-releasing code like time.sleep, which defeats the entire purpose of multithreading. Perhaps the code needs to accept a generator function, creating a generator object _per thread_.

Was this page helpful?
0 / 5 - 0 ratings