Keras: val very slow when use fit_generator

Created on 1 Apr 2017 · 25Comments · Source: keras-team/keras

I want to finetune ResNet-50 on my dataset.
But I face the problem that when one epoch end and start to run val set, it become really slow, the val time even longer than train time, I'm not sure what happened.
here is part of my code：

train_datagen = ImageDataGenerator(
    rescale=1./255,
    featurewise_center=False,  # set input mean to 0 over the dataset
    samplewise_center=False,  # set each sample mean to 0
    featurewise_std_normalization=False,  # divide inputs by std of the dataset
    samplewise_std_normalization=False,  # divide each input by its std
    zca_whitening=False,  # apply ZCA whitening
    rotation_range=20,  # randomly rotate images in the range (degrees, 0 to 180)
    width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
    height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
    horizontal_flip=True,  # randomly flip images
    vertical_flip=False,
    zoom_range=0.1,
    channel_shift_range=0.,
    fill_mode='nearest',
    cval=0.,

)
test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
    '/home/amanda/anaconda2/envs/tensorflow/lib/python2.7/site-packages/keras/datasets/nuclear/CRCHistoPhenotypes_2016_04_28/cropdetect/train',
    target_size=(224, 224),
    batch_size=batch_size,
    class_mode='categorical')

validation_generator = test_datagen.flow_from_directory(
    '/home/amanda/anaconda2/envs/tensorflow/lib/python2.7/site-packages/keras/datasets/nuclear/CRCHistoPhenotypes_2016_04_28/cropdetect/val',
    target_size=(224, 224),
    batch_size=batch_size,
    class_mode='categorical')
model.fit_generator(train_generator,
                    # steps_per_epoch=X_train.shape[0] // batch_size,
                    samples_per_epoch=35946,
                    epochs=epochs,
                    validation_data=validation_generator,
                    verbose=1,
                    nb_val_samples=8986,
                    callbacks=[earlyStopping,saveBestModel,tensorboard])

stale

Source

SIAAAAAA

Most helpful comment

Please reopen this issue I am still facing it on Keras 2.2.2

DavidWatkins on 8 Mar 2019

👍11

All 25 comments

I am having a similar problem where using flow is considerably faster than using flow_from_directory for both training and validation and I can't find a good reason to explain why. Would be grateful to get an insight from an expert in Keras :)

AhmadBaracat on 16 Apr 2017

👍11

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 15 Jul 2017

👎22

I'm having the same issue. using fit_generator, my validation step is significantly longer than my training step, even though it has fewer steps.

DustinReagan on 31 Aug 2017

👍10

me too facing the same problem. Validation step is very very slow.

vijayg78 on 7 Sep 2017

hope someone would figure out how to handle this problem

FightForCS on 23 Oct 2017

I believe I am having a related problem. When I run fit_generator in spyder the network goes through the training and finishes the first epoch fine (I'm only training one so I can try to debug). But then the kernel dies when it tries to validate.

tmullen93 on 1 Nov 2017

I’m facing the same issue. Running fit_generator and I found that validation process after each epoch is incredibly slow. I ran the same process using my model to predict the whole validation set, and it is a lot of faster (like 20x faster). Any idea about what is happening would be nice @fchollet

DanielTizon on 2 Dec 2017

👍3

I had the same issue. I fixed it by following the instruction in this issues :
https://github.com/fchollet/keras/issues/6406

You have to fixed the "steps_per_epoch" and "validation_steps" parameters correctly.

In the exemple of @SIAAAAAA, i think that uncoment the line
steps_per_epoch=X_train.shape[0] // batch_size,
and setting the validaiton_steps to X_val.shap[0] // batch_size
should be enough.

It considerably improve the training time for me.

bguillouet on 8 Dec 2017

👍2

I'm having the same problem, but using a custom generator. Has anyone solved this yet?

enggrey on 12 Dec 2017

I also faced the same problem, but by doing the following, the validation time was improved fast.

    # training
    hi = model.fit_generator(
        train_generator,
        #samples_per_epoch=n_iteration,
        steps_per_epoch=nb_samples//nb_batch,
        epochs=nb_epoch,
        validation_data=validation_generator,
        #nb_val_samples=nb_val_samples,
        validation_steps=nb_val_samples//nb_batch,
        callbacks=callbacks,
        verbose=1
    )

tak-s on 6 Feb 2018

Having the same problem; validation very slow. My generator is correct for keras 2 and I am using a GPU. Any one have a solution?
My model is fairly simple:

Keras:2.1.3

Layer (type) Output Shape Param #

flatten_1 (Flatten) (None, 64) 0

dense_1 (Dense) (None, 64) 4160

dense_2 (Dense) (None, 48) 3120

dense_3 (Dense) (None, 32) 1568

dense_4 (Dense) (None, 24) 792

dense_5 (Dense) (None, 12) 300

dense_6 (Dense) (None, 1) 13

model.compile(optimizer= "sgd", loss='mean_squared_error', metrics=['mae'])

mysteps= max(len(X_train) // batch_size, 1)

history = model.fit_generator(train_generator,
steps_per_epoch = mysteps,
epochs=30,
validation_data = valid_generator,
validation_steps = len(X_validation),
verbose=1)

This very slow when I running this on a aws ubuntu machine using a gpu. If I run the same code on a windows machine it is much faster. very strange

timtobey on 11 Feb 2018

I met the same problem today. When len(valid_generator)==500, it took me almost five minute to evaluate. When I change len(valid_generator) to 20, it took me less than 20 seconds. validation_steps and batch_size doesn't matter, it's len(valid_generator) that matter. Kind of wierd, I think. Because the validation time should be proportional to validation_steps and batch_size

Neutrino3316 on 31 Aug 2018

👍1

I found the same thing, the __len__ function on the generator needs to return a small number for the validation data generator (much smaller than for the training data generator) and then it becomes manageable. If both generators return the same length then validation is impossibly long - more than several hours in my case (I don't know how long because I never waited long enough to see!)

[Also, this seems to be only a problem if workers>0 in the fit_generator method. If I set workers=0 then validation completes fine in a short time]

keelinm on 19 Sep 2018

👎1 👍1

I have the same problem, in all epochs, the validation calculation is slower than the epoch training part.

The two phases take the same time with a Xeon with 18 core, but validation takes 4 times the training time with intel phi architeclure (tensorflow mkl binary).

I think/suspect that model evaluation for validation calculation does not take advantage of parallelization. This could be the core of the problem. Please check.

Regards

ftarlao on 1 Oct 2018

👍1

I'm still having this problem. It seems that the fit_generator method does not pay attention to the validation_steps parameter. I have set validation_steps at 15 but it is pulling len(data_generator) batches and ignoring this parameter value. As per comment by @Neutrino3316 above, it is the len(data_generator) method that matters to keep validation time down. And if this value is not extremely low, then the validation takes forever.
Can we reopen this issue as it is not fixed!

keelinm on 5 Dec 2018

👍10

@keelinm How to reopen this issue?

Neutrino3316 on 8 Dec 2018

@Neutrino3316 I am not sure how to reopen it. Maybe @fchollet or one of the team can advise on how to proceed...?

keelinm on 10 Dec 2018

Anyone know how to know the process of validation still running or the program just do nothing (freeze)?

AuliaRizky on 17 Dec 2018

Please reopen this issue I am still facing it on Keras 2.2.2

DavidWatkins on 8 Mar 2019

👍11

I have the same problem now.And I think the reason of the problem is that the speed of ImageDataGenerator to load data from disk is too slow.I test on my server to load 50 batch data which each batch have 32 images. It cost near 50 second to load these data.Maybe you can test it on your server again to ensure the root of the problem.

justicevita on 18 Apr 2019

👍3

@justicevita Agreed, my laptop (SSD) with a damn 940 gets val steps done way faster than my workstation armed with 1080Ti, like 10 times faster.

SIGN

Zepyhrus on 18 Apr 2019

In my case the overall data used in validation was, in size, smaller or equal than the training one, but the validation time was far greater, so, in my case, "validation slowness" seems not related with storage speed. I keep thinking that keras (with tensorflow backend) does not take advantage of parallelisation or any acceleration. My cent

ftarlao on 18 Apr 2019

Im seeing this issue on a relatively small data set:

train size: 52102, validation size: 6512, test_size: 6512

Of images using a data set on a Keras Model (no estimator) training in both eager and not eager mode results in one epoch taking ~ 5 minutes, but validation taking close to 25 minutes on Google Collab.

I set up my data set like so:

#split the final data set into train / validation splits to use for our model.
DATASET_SIZE = len(all_image_paths)

ds = ds.repeat()

train_size = int(0.8 * DATASET_SIZE)
val_size = int(0.1 * DATASET_SIZE)
test_size = int(0.1 * DATASET_SIZE)

print("train size: " + str(train_size) + ", validation size: " + str(val_size) + ", test_size: " + str(test_size))

train_dataset = ds.take(train_size)
test_dataset = ds.skip(train_size)
val_dataset = ds.skip(val_size)
test_dataset = ds.take(test_size)

And train like so:

steps_per_epoch =  int(math.floor(train_size/BATCH_SIZE))
val_steps_per_epoch = int(math.floor(val_size/BATCH_SIZE))
epochs = 5
history = model.fit(train_dataset, epochs=epochs, steps_per_epoch=steps_per_epoch, validation_data=val_dataset, validation_steps=val_steps_per_epoch)

Curious if anyone has any pointers. Thank you in advance.

vade on 10 Sep 2019

I recently had the same issue and that was because my validation_steps was 100 000. Decrease it to an acceptable value ( starting at 10 then 100 then 1000 ...) has solved my problem.
Now my network validation duration is acceptable.

laurentgrenier on 18 Dec 2019

try to specify the validation_steps correctly .If you didn't do that or put it randomly , you will face strange lag while overwhelmed data are generated while validation , like my snippet code
concate_model.compile(loss='mean_squared_error' ,metrics={'Steer': 'mse', 'Speed':'mse '}, optimizer=Adam(learning_rate=args.learning_rate)) history = concate_model.fit_generator(batch_generator(args.data_dir,X_train_image ,X_train_Sequence,Y_train_steer,Y_train_speed, args.batch_size, True,args.samples_per_epoch), args.samples_per_epoch, args.nb_epoch, max_q_size=10, validation_data=batch_generator(args.data_dir, X_valid_image,X_valid_Sequence ,Y_valid_steer ,Y_valid_speed, args.batch_size, False,args.samples_per_epoch), callbacks=[checkpoint], verbose = 1,validation_steps = args.samples_per_epoch*args.batch_size*args.test_size)