Keras: Race condition in fit_generator() when generator exits

Created on 30 Apr 2016 · 12Comments · Source: keras-team/keras

Users of generator_queue() like fit_generator() have a race condition when the generator exits after generating a number of samples equal to or slightly greater than samples_per_epoch. The function will occasionally fetch None from the queue instead of the final elements.

stale

Source

sehugg

👍3

Most helpful comment

Similar problem here. I understand that the generator should yield indefinitely, and that does "fix" the problem. But to me, requiring the generator to yield indefinitely IS the problem. Here is why:

I'm using predict_generator() to classify images. My generator code is similar to this (simplified):

def gen():
    for i in range(100):
        yield load_images([i*batch_size : (i+1)*batch_size])

Keras returns this error:

ValueError: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None

After spending about an hour debugging my code to find out why my generator was returning None, I finally stumbled upon this thread and realized that Keras expects generators to yield indefinitely, even when they don't have anything else to yield.

Unlike in training, when you call evaluate_generator() and predict_generator(), there is a limited number of samples and each sample is expected to be processed once. It seems odd to ask me to re-loop through my data again, and it goes against the simple straightforward style that Keras is known for.
Even in training, when an infinite supply of samples is expected, the error message should make the requirement to yield indefinitely very clear. As it is right now, it makes it sound like the generator did not return any results at all. Which is what sent me debugging the wrong problem.

waleedka on 6 Feb 2017

👍6

All 12 comments

The generator is expected to loop indefinitely according to the documentation so if you by _"when the generator exits"_ mean that it's finished and that another next() call will result in a StopIteration being raised, then this is not a necessary fix.

carlthome on 30 Apr 2016

👍1

You might want to iterate over a finite data set once; right now you have to add an unspecified number of padding elements to do this. Also, evaluate_generator() and predict_generator() have the same problem, but I should update the doc to say "generator must return at least (samples_per_epoch * nb_epoch) elements".

sehugg on 1 May 2016

If you want to iterate over a finite data set just once, then you're expected to set nb_epochs=1 when calling fit_generator(). Doesn't that work?

The thing is that the generator runs on its own thread to maximize GPU throughput (and there even used to be multi-threading support!), which is why it should never stop. The thread has a queue of samples that handles everything, but for that to work the generator must never finish.

carlthome on 1 May 2016

If you want to iterate over a finite data set just once, then you're expected to set nb_epochs=1 when calling fit_generator. Doesn't that work?

Yes, that's what I'm doing, but because of the race condition the generator has to stay alive even after it has yielded all of its samples. The fix is simple, it's just a matter of making sure the queue is empty before you check the _stop event.

sehugg on 1 May 2016

Again, the generator should yield its samples indefinitely and could therefore never exit. You're not making much sense.

"generator must return at least (samples_per_epoch * nb_epoch) elements".

No, the generator should yield exactly samples_per_epoch distinct samples repeatedly, forever.

carlthome on 1 May 2016

My dataset doesn't fit into memory, so I load new files and create a new generator after every epoch, passing it to fit_generator() with nb_epochs = 1.

I guess I could use train_on_batch() or make the generator loop over its data multiple times, but I don't see why finite generators are a problem when you know exactly how many samples will be consumed.

sehugg on 1 May 2016

If you do that you loose the immense performance benefit of being able to buffer up samples in the queue at the end of an epoch for the start of the next epoch.

Unlike the first batches of the first epoch in which your model has to wait a while for the data to be processed initially, the start of every successive epoch can start immediately because the generator queue keeps at it.

You really shouldn't kill the data generator thread like you're doing, for this reason alone.

What do you even gain from initializing a data generator per epoch?

carlthome on 1 May 2016

As part of the preprocessing I have to evaluate the input on a separate large model which is too big to coexist with the training model in RAM. So I can't easily do this inside of the generator since it would swap out the model being trained and they would both grind to a halt. Probably suboptimal, but my epochs are so long that it isn't a huge performance hit.

sehugg on 2 May 2016

That seems like a legitimate use case (albeit exotic). It seems like an uphill battle to reload models continuously during training, and nothing was built with this in mind as far as I know.

When you say RAM do you mean for the CPU or GPU VRAM? I'd just focus on having enough memory such that the models fit, but it could be difficult with VRAM until summer. RAM is cheap though! What is your memory usage? I'm curious.

Depending on your backend there are optimizations that can be turned on to reduce memory usage at the cost of increased computation time. Hints for Theano, for example.
You should also consider reusing parts of the computational graph between the models. You shouldn't need to declare both, unless they're totally different of course.
If all else fails, run the models on separate machines, and use on_epoch_end to communicate over a network. This is probably your safest bet.

carlthome on 4 May 2016

There's an option to pass in a generator to validation_data in fit_generator. There's a finite amount of data to loop over and you'll have a StopIteration to indicate the end.

More broadly, I'd like Keras to have better support for generators -- they're a handy of working with large amounts of data.

roryhr on 13 May 2016

Similar problem here. I understand that the generator should yield indefinitely, and that does "fix" the problem. But to me, requiring the generator to yield indefinitely IS the problem. Here is why:

I'm using predict_generator() to classify images. My generator code is similar to this (simplified):

def gen():
    for i in range(100):
        yield load_images([i*batch_size : (i+1)*batch_size])

Keras returns this error:

ValueError: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None

Unlike in training, when you call evaluate_generator() and predict_generator(), there is a limited number of samples and each sample is expected to be processed once. It seems odd to ask me to re-loop through my data again, and it goes against the simple straightforward style that Keras is known for.
Even in training, when an infinite supply of samples is expected, the error message should make the requirement to yield indefinitely very clear. As it is right now, it makes it sound like the generator did not return any results at all. Which is what sent me debugging the wrong problem.

waleedka on 6 Feb 2017

👍6

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs, but feel free to re-open it if needed.