Keras: Evaluate_generator produces wrong accuracy scores?

Created on 4 May 2017 · 33Comments · Source: keras-team/keras

Hello, I run a slightly modified version of the keras fine tuning examples which only fine tunes the top layers (with Keras 2.0.3/Tensorflow on Ubuntu with GPU). This looks like the following:

img_width, img_height = 150, 150
train_data_dir = 'data/train_s'
validation_data_dir = 'data/val_s'
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 10
batch_size = 16

base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))

top_model = Sequential()
top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dense(1, activation='sigmoid'))

model = Model(inputs=base_model.input, outputs=top_model(base_model.output))
model.compile(loss='binary_crossentropy', optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size,
    verbose=2, workers=12)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

correct = 0
for i, n in enumerate(validation_generator.filenames):
    if n.startswith("cats") and scores[i][0] <= 0.5:
        correct += 1
    if n.startswith("dogs") and scores[i][0] > 0.5:
        correct += 1

print("Correct:", correct, " Total: ", len(validation_generator.filenames))
print("Loss: ", score[0], "Accuracy: ", score[1])

With this, I get unreliable validation accuracy results. For example, predict_generator predicts 640 out of 800 (80%) classes correctly whereas evaluate_generator produces an accuracy score of 95%. Someone in #3477 suggests to remove the rescale=1. / 255 parameter from the validation generator, then I get results of 365/800=45% and 89% from evaluate_generator.

Is there something wrong with my evaluation or is this due to a bug? There are many similar issues (e.g. #3849, #6245) where the stated accuracy (during training and afterwards) doesn't match the actual predictions. Could someone experienced maybe shine some light onto this problem? Thanks

Source

skoch9

👍19

Most helpful comment

validation_generator = test_datagen.flow_from_directory(shuffle=True) , the shuffle=True is set by default.
I found that using predict_generator(validation_generator) will return a shuffled result while validation_generator.classes and validation_generator.filenames does not shuffle the results... So the accuracy calculated after using predict_generator may cause wrong answer.

xing89qs on 28 Dec 2017

👍11 🎉2 ❤1

All 33 comments

Having looked into the backend I believe this is due to the number of workers if you have more than one worker there is nothing to ensure consistency of file loading across them. Therefore it is possible that multiple files are shown numerous times in these methods, as the 12 generators are randomly initialised but don't actually share a file list state.

Maybe @fchollet or @farizrahman4u can confirm?

joeyearsley on 7 May 2017

Thanks for your answer. I already considered that the image generators are not threadsafe. However, I was able to reproduce the problem with setting all three worker counts to 1. Does the order of the filenames maybe not correspond to the order of the scores?

Recently added similar issues are #6540 and #6544.

skoch9 on 8 May 2017

Plus, @fchollet mentioned here that the ImageDataGenerator supports multiprocessing.

skoch9 on 8 May 2017

@skoch9 try to set pickle_safe=True. As @joeyearsley mention, for me it had to do with worker > 1. I am actually running on a custom version of Keras where I make evaluate_generator inside fit_generator workers=1. That way I can train with multiple workers but predict/evaluate with a single worker.

@fchollet Please make evaluate_generator, and predict_generator workers=1 always, or eliminate parameter until it is fixed.

Make sure:
1) shuffle = false
2) pickle_safe = True
3) workers = 1

Let me know if that gives you consistent results.

avn3r on 15 May 2017

❤1 😕1 🎉1 👍1

Thanks @abnera for your answer. I investigated a little further and found that running evaluate_generator before predict_generator without setting pickle_safe=True messes up the predictions of the latter, even without multiprocessing.

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, pickle_safe=False)
scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size)

As for the parameters, when setting workers >1, shouldn't pickle_safe automatically be set to True also for the fit_generator?

skoch9 on 16 May 2017

@skoch9 no. Pickle safe is false by default in all keras generator methods. The only difference between this two modes is False uses multithread and True uses multiprocess.

avn3r on 16 May 2017

Sure, but it doesn't make sense to allow invalid parameter configurations. And I would consider it also problematic (->a bug) that running evaluate_generator before predict_generator changes the prediction results.

skoch9 on 16 May 2017

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 14 Aug 2017

Hello
I made a test based on the mnist_cnn.py example.
I just change model.fit by

gen=matrix_generator(x_train,y_train,batch_size=batch_size) val_steps = len(y_test)/batch_size

model.fit_generator(gen,epochs=epochs, steps_per_epoch= steps_per_epoch, validation_data=(x_test, y_test))

with the generator define to take subpart of the input matrix

def matrix_generator(x,y,batch_size=1,validation=False):
import numpy as np
item=0;
tot_item,dimx,dimy,dimz=x.shape

X = np.zeros((batch_size,dimx,dimy,dimz), dtype=np.float32)

if len(y.shape)==1:
    Y = np.zeros((batch_size))
else:
    Y = np.zeros((batch_size, y.shape[1]))

while True:
    for bs in range(batch_size):
        ximg = x[item,:,:,:]

        yout = y[item,:]                
        yout = yout.reshape(1,yout.shape[0])
        Y[bs,:] = yout

        X[bs,:,:,:] = ximg

        item +=1
        if item>tot_item-1:
            item=0

    if validation:
        yield X
    else:
        yield X,Y

`
Now if I test the model

model.evaluate(x_test, y_test, verbose=0)
Out[18]: [0.16519621040821075, 0.96389999999999998]

but
gen_test = matrix_generator(x_test,y_test,batch_size=batch_size)
model.evaluate_generator(gen_test,len(y_test)/batch_size,workers=1)
Out[17]: [0.31920816290837067, 0.93760016025641024]

same learn parameter, same test data, but different results

What to do ???
Many thanks for your help

romainVala on 29 Aug 2017

To follow up
I get the correct result only if I set
max_q_size=1

if I only set worker=1 it does not woks either (and give different results each time)

romainVala on 30 Aug 2017

Hi,
I am having a similar problem with a binary classifier that uses 2 outputs. model.evaluate_generator() and model.predict() both suggest that I have 50% accuracy (chance) and a loss of over 1.0, while model.fit_generator() always gives me a loss of under 0.5 and accuracy of over 80%.

for x in range(10000):
    print("starting epoch {0}".format(x))
    mg_acc = mg_model.evaluate_generator(mg_gen, 2, 
        max_queue_size=1, workers=1, use_multiprocessing=False)
    mg_model.fit_generator(mg_gen, 2, epochs=1, callbacks=mg_callbacks, 
        max_queue_size=1, workers=1, use_multiprocessing=False, verbose=1)
    print(mg_acc)
    print(mg_model.metrics_names)

It does not matter if use_multiprocessing is True or False, or the order of the calls (fit_generator before evaluate_generator or vice-versa), or if one call is commented out.

I know that this code will not result in identical generator output (each calls the next iteration of the generator) but the output of the generator is similar between iterations, and the evaluate_generator is always the one that does poorly, regardless of order.

EDIT: SOLVED: My accuracy difference was eliminated when I removed all batch normalization layers (also removed dropout layers, but I don't think that was the cause)

ghost on 10 Oct 2017

I too have a big difference between the reported fit_generator results and the later evaluate_generator results.

I've looked into this a bit and found the following results:

When i use evaluate_generator with a generator that does not shuffle the suite, i get results that are
very different than those reported by fig_generator.
But when i use evaluate_generator with a generator that does shuffle the suite, i get results that are
similiar to those reported by fit_generator.

When i use evaluate (without any generators) the output is exactly the same as evaluate_generator
without shuffling
When i use model.predict and infer the measurements manually, i get the same measurements
reported by fit_generator (and the same results as evaluate_generator with shuffling)

Can anyone verify that any of the above happen to them as well?

A note:
I use a very simple model - just one dense layer.
no dropouts or batch normalizations to create any doubts as mentioned by @jeremydr2

GalAvineri on 4 Dec 2017

👍1

@GalAvineri : Very interesting... I just came across your post, after posting my own frustrations: https://github.com/fchollet/keras/issues/5818. I will try to add shuffling to my evaluate_generator, to see if this makes any difference, and if I get the same results as you do, although I cannot see any sense in shuffling during model evaluation...

jnygaard on 10 Dec 2017

I had a similar problem using fit_generator with multiprocessing under Linux:
During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again.
Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses 'fork(2)' to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn't a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in __getitem__(self, idx).

Maybe this saves time for some of you.

[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/

csandmann on 12 Dec 2017

👍1

xing89qs on 28 Dec 2017

👍11 🎉2 ❤1

I also ran into a similar issue where my fit_generator gives an accuracy and loss of 98% and 0.08. evaluate_generator gives the same accuracy if I use rescale=1. / 255 but if I don't, I get an accuracy and loss of 50% and 7.9 respectively. predict_generator always gives only 50% accuracy even if I use rescale=1. / 255 or not. What should I do?

@jeremydr2 When you removed the batch normalization and dropout what was your accuracy and loss? Was it still 80% and 0.5? Because for me it fell to 50%

@GalAvineri I have a similar issue, but I do not think that the issue is about using shuffle but it is because of the rescale=1. / 255 in ImageDataGenerator

keshavunni on 19 Feb 2018

This issue is reproduced regularly while using fit_generator / evaluate_generator, and it seems pretty critical since it makes fit_generator output during training completely useless.

hokmund on 20 Apr 2018

👍13

My problem was using loss='binary_crossentropy' instead of loss='categorical_crossentropy'. This caused my accuracy to be 96% before I even started training on my 15 different classes.

Just noting that for future googlers. Might not be relevant for this thread topic specifically. It might be worth to warn against this during evaluation in keras when a model has 2+ outputs and the accuracy is binary.

https://stackoverflow.com/questions/42081257/keras-binary-crossentropy-vs-categorical-crossentropy-performance

ubershmekel on 30 Aug 2018

I had a similar problem using fit_generator with multiprocessing under Linux:
During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again.
Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses 'fork(2)' to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn't a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in getitem(self, idx).

Maybe this saves time for some of you.

[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/

I'm not sure I understand your solution. When you say, "Resolution was to seed np.random in getitem(self, idx)." Could you explain how you did this a little more thoroughly? How do you seed np.random in getitem(self,idx)?

klday on 13 Sep 2018

Hey guys i'm having similar issues with predict_generator.

When i'm training i get around 92% train acc and 80% val_acc but when i make predictions and put them through a confusion matrix the acc drops to 50%, any updates on this

jetychill on 25 Dec 2018

hi, same here - manually computed accuracy on this kaggle competition https://www.kaggle.com/c/aerial-cactus-identification gives me 50-65% accuracy (binary classification problem) while predict_generator
gives roughly 95%. I tried drop_duplicates=True, seed=2019, pickle_safe = True, workers=1 args and I'm using the correct loss='binary_crossentropy' for the model. I'm happy to give full details but the overall picture seems to be the same as above.

truth_generator = datagen.flow_from_dataframe(dataframe=df_truth, directory="test", x_col="id", y_col="has_cactus", class_mode="binary", target_size=(32,32), batch_size=200, drop_duplicates=True, seed=2019, pickle_safe = True, workers=1)
predictions = model.predict_generator(truth_generator, steps=20)

maglkp on 13 May 2019

You need to add shuffle=False to flow_from_dataframe.

lipinski on 3 Jun 2019

👍2

Hi,

I encountered the same issue recently, and actually the solution is quite simple.
You use validation_generator two times in a row, and I imagine your number of samples isn't exactly divisible by your batch size. Hence, your generator has a shift in its indices after you use it in model.evaluate_generator. So when you call it, the generator won't yield the sampels in the order you expect.

So you should create a second generator to use in model.predict_generator, or only evaluate your model via evaluate_generator or predict_generator :

validation_generator2 = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_height, img_width),
    batch_size=batch_size,
    class_mode='binary', shuffle=False)

score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)

scores = model.predict_generator(validation_generator2, nb_validation_samples/batch_size, workers=12)

adilincepto on 14 Jun 2019

👍1

I am experiencing a similar problem, my model is trained using fit_generator, saved using model.save, loaded using load_model and evaluated using evaluate_generator but its accuracy is similar to the untrained one. However, model.predict (without generator) works adequately well.

Using keras.__version__ = 2.2.4.

pcko1 on 17 Jun 2019

this seems to be still a problem. One simple fix was not to use multi-process

JaeDukSeo on 25 Jul 2019

I have encountered this problem today.
And I found the solution to this:

You must set shuffle=False in your generator.
you need to reset your generator before calling predict_generator() function.
For example:

valid_generator = datagen.flow_from_dataframe(
    dataframe=train_df,
    directory="../images/train/",
    x_col="id",
    y_col="label_2",
    subset="validation",
    batch_size=batch_size,
    seed=42,
    shuffle=False,
    class_mode="categorical",
    classes=classes,
    target_size=(input_shape, input_shape))
step_size_valid = np.ceil(valid_generator.n / valid_generator.batch_size)
model.evaluate_generator(generator=valid_generator, steps=step_size_valid)
...
valid_generator.reset()
model.predict_generator(valid_generator, step_size_valid)

BruceDai003 on 6 Aug 2019

👍5 🚀3 🎉2 😄1

The predict_generator on a custom data generator for me also produces a different result compared to eval_generator and the verbose output from fit_generator. I didn't shuffle the index in predict_generator

Dovermore on 9 Dec 2019

Same issue. Totally confused from the answers above. Any simple solution?

ghasemikasra39 on 5 Jan 2020

👍1

same issue. any simple solution from the official authors yet?

achbogga on 10 Jan 2020

👍1

Similiar observations to @GalAvineri , when setting shuffle = False, my evaluate_generator accuracy goes up to 92%, which is great, but unrealistic. Setting it to true yields much more realistic results. I would appreciate some official guidelines on this - there are others who argue otherwise (@lipinski) which makes it incredibly confusing

EXJUSTICE on 18 Mar 2020

Same issue. Totally confused from the answers above. Any simple solution?

kkviks on 6 Jul 2020

Same issue/

jtxtina on 10 Jul 2020

Ok, I solved the issue.
My code previously:

scores_evaluation = model.evaluate_generator(test_generator.flow(X_test,Y_test, batch_size=32, shuffle=False),len(Y_test)/32)
scores_prediction = model.predict_generator(test_generator.flow(X_test, batch_size=32, shuffle=False),len(Y_test)/32)

However, in flow() shuffle is default to True, so for each flow function here, I added shuffle=False. Then problem solved.

jtxtina on 10 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings