Keras: fit_generator gets stuck if validation_data generator is used on keras 2.2.2/2.2.1, but not on 2.2.0

Created on 21 Aug 2018 · 13Comments · Source: keras-team/keras

I am combining the pybedtools library with the Keras Sequence class and then training with the fit_generator function. After 1 or 2 epochs of training, Keras gets stuck and refuses to go onto the next epoch of training. This error only appears in Keras 2.2.2 and 2.2.1, but is absent in 2.2.0.

Here is a piece of code that should reliably reproduce the bug (may need to run a few times to make sure):

https://gist.github.com/daquang/f2ffcc8d95e4c092c58a5d3036144185

You can install pybedtools (https://github.com/daler/pybedtools) on Anaconda with

conda install -c bioconda pybedtools

The error goes away if line 30 is removed or if validation_data is set to None in line 64.

Source

daquang

👍4

Most helpful comment

Oh your right, my bad.

Could you try adding at the start of your script?:

import multiprocessing as mp
mp.set_start_method('spawn', force=True)

I think your library is not fork-safe?

Dref360 on 4 Oct 2018

👍2

All 13 comments

I also using Keras 2.2.2, and I always use validation_data generator but I never have this error.

I think it may relate to the use_multiprocessing= True and workers, it has been describe in other issue as I remember. You may try using only workers to see the result

NTNguyen13 on 21 Aug 2018

Yes, the problem can also be solved by setting use_multiprocessing=False, but that will slow down training time. Ideally I'd like to have it set true, while also keeping a validation_data generator. It just seems strange to me how this problem goes away when I use Keras 2.2.0 instead.

daquang on 21 Aug 2018

I'm also seeing this on Keras 2.2.2, but not on 2.2.0 - looks like a bug that's been introduced.
Setting use_multiprocessing=False doesn't help for me.

davidADSP on 22 Sep 2018

After banging my head for a while on this issue, I tried
c = pbt.BedTool(a.cat(b,postmerge=True)) and the deadlock doesn't seem to happen anymore? Could you confirm?

Dref360 on 22 Sep 2018

No, the deadlock still occurs even when I set postmerge to True. Actually, the default option for postmerge is True. Try running it a few times. The deadlock typically occurs for me only about 1 out of 3 times.

daquang on 28 Sep 2018

Oh your right, my bad.

Could you try adding at the start of your script?:

import multiprocessing as mp
mp.set_start_method('spawn', force=True)

I think your library is not fork-safe?

Dref360 on 4 Oct 2018

👍2

I have the same scenario, while not using pybedtools - with keras 2.2.4 - 2.2.1 and tensorflow 1.10.1 - 1.9.
Not sure if it helps but your codelines give me the following using keras 2.2.0 and tensorflow 1.9

Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(self._args, *self._kwargs)
File "/home/jaremenko/.local/lib/python3.5/site-packages/keras/utils/data_utils.py", line 548, in _run
with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
File "/home/jaremenko/.local/lib/python3.5/site-packages/keras/utils/data_utils.py", line 522, in
initargs=(seqs,))
File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool
context=self.get_context())
File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in __init__
self._repopulate_pool()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
w.start()
File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.5/multiprocessing/context.py", line 274, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 33, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 48, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 59, in dump
ForkingPickler(file, protocol).dump(obj)
MemoryError

nik85 on 4 Oct 2018

I had this "error" in the run with the error log as shown above.
2018-10-04 20:51:53.203955: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.16GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

Reducing batch size, such that no error occured led to two successful runs with stored models.
Not sure where the connection is, just wanted to let you know - maybe this was just luck?
Removed multiprocessing code - running on Keras 2.2.0 and Tensorflow 1.9 - i guess it might have just been the memory - no connection to Keras and Tensorflow version?

nik85 on 5 Oct 2018

@Dref360 Sorry for taking so long. When I added those lines at the start of the script, it can reliably finish. However, now I am getting the using Tensorflow backend message several times each epoch, and now each epoch is considerably slower by a few seconds (which may not be bad if real training time is several hours per epoch). The constant messages are a little annoying though.

daquang on 14 Dec 2018

👍1

I'm on a HPC Cluster and I still use TensorFlow 1.10 with the integrated Keras 2.1.6. I'm training on V-100 GPUs and they are barely utilized without using multiprocessing, as the GPU is just too fast and the data won't be loaded in fast enough. This works fine with one GPU. GPUs and multiprocessing works for two V-100 GPUs as well. When using a K-20 I can run one with multiprocessing, but two will run into a deadlock before the validation process of the first epoch. I use the Keras ImageDataGenerator and flow_from_directory, no custom written Generator or similar.

from __future__ import print_function
import numpy as np
import os
import sys
import logging

if len(sys.argv) < 4:
    logging.error("Too few arguments! Usage: python3 " + sys.argv[0] + " <input-folder> <batch-size> <gpu-count>")
    sys.exit(1)

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import math
import time
from random import shuffle
# from tf.keras import models
# from tf.keras.preprocessing.image import ImageDataGenerator
# from tf.keras.models import Sequential
# from tf.keras.layers import Dense, Dropout, Flatten
# from tf.keras.utils import plot_model, to_categorical
# from tf.keras.layers import Conv2D, MaxPooling2D
# from tf.keras import backend as K
#from matplotlib import pyplot as plt
#from tf.keras.models import load_model

train_data_folder = str(sys.argv[1]) + "/train"
validation_data_folder = str(sys.argv[1]) + "/validation"

print("Train data folder: " + train_data_folder)
print("Validation data folder: " + validation_data_folder)

batch_size = int(sys.argv[2])
max_word_length = 16
epochs = 15
img_width, img_height = 265, 64
gpu_count = int(sys.argv[3])

# mirrored_strategy = tf.contrib.distribute.MirroredStrategy()

logging.debug("Creating model")
# with mirrored_strategy.scope():
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(12, kernel_size=(3, 3),
                activation='relu',
                data_format='channels_last',
                input_shape=(img_height, img_width, 1)))
model.add(tf.keras.layers.Conv2D(24, (3, 3), data_format='channels_last', activation='relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv2D(12, (3, 3), data_format='channels_last', activation='relu'))
model.add(tf.keras.layers.Conv2D(24, (3, 3), data_format='channels_last', activation='relu'))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(max_word_length, activation='softmax'))

model.summary()

logging.debug("Compiling model")

start = time.time()

if gpu_count <= 1:
    logging.debug("Training with 1 or less GPU's.")
    model.compile(loss='categorical_crossentropy',
        optimizer=tf.keras.optimizers.Adadelta(),
        metrics=['accuracy'])
else:
    logging.debug("Training with {} GPUs...".format(gpu_count))

    with tf.device("/cpu:0"):
        model = tf.keras.utils.multi_gpu_model(model, gpus=gpu_count)

        model.compile(loss='categorical_crossentropy',
        optimizer=tf.keras.optimizers.Adadelta(),
        metrics=['accuracy'])

end = time.time()

print("Time to compile model: ", end='')
print(end - start)

logging.debug("Fitting/Training model")

start = time.time()

train_datagen = ImageDataGenerator(
    rescale=1./255,
    data_format="channels_last")
        #shear_range=0.2,
        #zoom_range=0.2,
        #horizontal_flip=True)

test_datagen = ImageDataGenerator(
    rescale=1./255,
    data_format="channels_last")
        #shear_range=0.2,
        #zoom_range=0.2,
        #horizontal_flip=True)

train_generator = train_datagen.flow_from_directory(
    train_data_folder,
    target_size=(img_height, img_width),
    color_mode='grayscale',
    batch_size=batch_size,
    class_mode='categorical')

validation_generator = test_datagen.flow_from_directory(
    validation_data_folder,
    target_size=(img_height, img_width),
    color_mode='grayscale',
    batch_size=batch_size,
    class_mode='categorical')

end = time.time()

print("Time to prepare input data: ", end='')
print(end - start)

count_of_validation_files = 0
for root, dirs, files in os.walk(validation_data_folder):
    count_of_validation_files += len(files)

count_of_train_files = 0
for root, dirs, files in os.walk(train_data_folder):
    count_of_train_files += len(files)

start = time.time()

model.fit_generator(
    train_generator,
    steps_per_epoch = math.ceil(count_of_train_files / batch_size), # Needs to be set for newer tensorflow versions due a bug: https://github.com/tf.keras-team/tf.keras/issues/11452 and MirroredStrategy
    epochs=epochs,
    validation_data = validation_generator,
    validation_steps = math.ceil(count_of_validation_files / batch_size), # Needs to be set for newer tensorflow versions due a bug: https://github.com/tf.keras-team/tf.keras/issues/11452 and MirroredStrategy
    use_multiprocessing=True,
    workers=24
    )

end = time.time()

print("Time to train model: ", end='')
print(end - start)

score = model.evaluate_generator(train_generator, steps=math.ceil(count_of_train_files / batch_size))

print('Test loss:', score[0])

print('Test accuracy:', score[1])

model.save('mnist_word_length-' + str(sys.argv[2]) + '.h5')

print("Batch size: " + str(batch_size))
print("Epoch amount: " + str(epochs))
print("GPU amount: " + str(gpu_count))

PowerOfCreation on 17 Apr 2019

But its not always possible to rely on defined generators. And the problem still persists with custom generators.

srv902 on 30 Aug 2019

This might be pure conjecture but: I had the same issue when passing multiprocessing=True
in fit_generatorsince a non pickable items (functions) was included in my Sequenceobject state.
Once I got rid of that the problem did not reproduce itself (using tensorflow 1.13.1)