Hello, I run a slightly modified version of the keras fine tuning examples which only fine tunes the top layers (with Keras 2.0.3/Tensorflow on Ubuntu with GPU). This looks like the following:
img_width, img_height = 150, 150
train_data_dir = 'data/train_s'
validation_data_dir = 'data/val_s'
nb_train_samples = 2000
nb_validation_samples = 800
epochs = 10
batch_size = 16
base_model = applications.VGG16(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))
top_model = Sequential()
top_model.add(Flatten(input_shape=base_model.output_shape[1:]))
top_model.add(Dense(256, activation='relu'))
top_model.add(Dense(1, activation='sigmoid'))
model = Model(inputs=base_model.input, outputs=top_model(base_model.output))
model.compile(loss='binary_crossentropy', optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
metrics=['accuracy'])
train_datagen = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1. / 255)
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='binary', shuffle=False)
model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=nb_validation_samples // batch_size,
verbose=2, workers=12)
score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)
scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size, workers=12)
correct = 0
for i, n in enumerate(validation_generator.filenames):
if n.startswith("cats") and scores[i][0] <= 0.5:
correct += 1
if n.startswith("dogs") and scores[i][0] > 0.5:
correct += 1
print("Correct:", correct, " Total: ", len(validation_generator.filenames))
print("Loss: ", score[0], "Accuracy: ", score[1])
With this, I get unreliable validation accuracy results. For example, predict_generator predicts 640 out of 800 (80%) classes correctly whereas evaluate_generator produces an accuracy score of 95%. Someone in #3477 suggests to remove the rescale=1. / 255
parameter from the validation generator, then I get results of 365/800=45% and 89% from evaluate_generator.
Is there something wrong with my evaluation or is this due to a bug? There are many similar issues (e.g. #3849, #6245) where the stated accuracy (during training and afterwards) doesn't match the actual predictions. Could someone experienced maybe shine some light onto this problem? Thanks
Having looked into the backend I believe this is due to the number of workers if you have more than one worker there is nothing to ensure consistency of file loading across them. Therefore it is possible that multiple files are shown numerous times in these methods, as the 12 generators are randomly initialised but don't actually share a file list state.
Maybe @fchollet or @farizrahman4u can confirm?
Thanks for your answer. I already considered that the image generators are not threadsafe. However, I was able to reproduce the problem with setting all three worker counts to 1. Does the order of the filenames maybe not correspond to the order of the scores?
Recently added similar issues are #6540 and #6544.
Plus, @fchollet mentioned here that the ImageDataGenerator
supports multiprocessing.
@skoch9 try to set pickle_safe=True
. As @joeyearsley mention, for me it had to do with worker > 1. I am actually running on a custom version of Keras where I make evaluate_generator
inside fit_generator workers=1. That way I can train with multiple workers but predict/evaluate with a single worker.
@fchollet Please make evaluate_generator, and predict_generator workers=1
always, or eliminate parameter until it is fixed.
Make sure:
1) shuffle = false
2) pickle_safe = True
3) workers = 1
Let me know if that gives you consistent results.
Thanks @abnera for your answer. I investigated a little further and found that running evaluate_generator
before predict_generator
without setting pickle_safe=True
messes up the predictions of the latter, even without multiprocessing.
score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, pickle_safe=False)
scores = model.predict_generator(validation_generator, nb_validation_samples/batch_size)
As for the parameters, when setting workers >1
, shouldn't pickle_safe
automatically be set to True
also for the fit_generator
?
@skoch9 no. Pickle safe is false by default in all keras generator methods. The only difference between this two modes is False uses multithread and True uses multiprocess.
Sure, but it doesn't make sense to allow invalid parameter configurations. And I would consider it also problematic (->a bug) that running evaluate_generator
before predict_generator
changes the prediction results.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Hello
I made a test based on the mnist_cnn.py example.
I just change model.fit by
gen=matrix_generator(x_train,y_train,batch_size=batch_size)
val_steps = len(y_test)/batch_size
model.fit_generator(gen,epochs=epochs,
steps_per_epoch= steps_per_epoch,
validation_data=(x_test, y_test))
with the generator define to take subpart of the input matrix
def matrix_generator(x,y,batch_size=1,validation=False):
import numpy as np
item=0;
tot_item,dimx,dimy,dimz=x.shape
X = np.zeros((batch_size,dimx,dimy,dimz), dtype=np.float32)
if len(y.shape)==1:
Y = np.zeros((batch_size))
else:
Y = np.zeros((batch_size, y.shape[1]))
while True:
for bs in range(batch_size):
ximg = x[item,:,:,:]
yout = y[item,:]
yout = yout.reshape(1,yout.shape[0])
Y[bs,:] = yout
X[bs,:,:,:] = ximg
item +=1
if item>tot_item-1:
item=0
if validation:
yield X
else:
yield X,Y
`
Now if I test the model
model.evaluate(x_test, y_test, verbose=0)
Out[18]: [0.16519621040821075, 0.96389999999999998]
but
gen_test = matrix_generator(x_test,y_test,batch_size=batch_size)
model.evaluate_generator(gen_test,len(y_test)/batch_size,workers=1)
Out[17]: [0.31920816290837067, 0.93760016025641024]
same learn parameter, same test data, but different results
What to do ???
Many thanks for your help
To follow up
I get the correct result only if I set
max_q_size=1
if I only set worker=1 it does not woks either (and give different results each time)
Hi,
I am having a similar problem with a binary classifier that uses 2 outputs. model.evaluate_generator()
and model.predict()
both suggest that I have 50% accuracy (chance) and a loss of over 1.0, while model.fit_generator()
always gives me a loss of under 0.5 and accuracy of over 80%.
for x in range(10000):
print("starting epoch {0}".format(x))
mg_acc = mg_model.evaluate_generator(mg_gen, 2,
max_queue_size=1, workers=1, use_multiprocessing=False)
mg_model.fit_generator(mg_gen, 2, epochs=1, callbacks=mg_callbacks,
max_queue_size=1, workers=1, use_multiprocessing=False, verbose=1)
print(mg_acc)
print(mg_model.metrics_names)
It does not matter if use_multiprocessing is True or False, or the order of the calls (fit_generator
before evaluate_generator
or vice-versa), or if one call is commented out.
I know that this code will not result in identical generator output (each calls the next iteration of the generator) but the output of the generator is similar between iterations, and the evaluate_generator
is always the one that does poorly, regardless of order.
EDIT: SOLVED: My accuracy difference was eliminated when I removed all batch normalization layers (also removed dropout layers, but I don't think that was the cause)
I too have a big difference between the reported fit_generator results and the later evaluate_generator results.
I've looked into this a bit and found the following results:
When i use evaluate (without any generators) the output is exactly the same as evaluate_generator
without shuffling
When i use model.predict and infer the measurements manually, i get the same measurements
reported by fit_generator (and the same results as evaluate_generator with shuffling)
Can anyone verify that any of the above happen to them as well?
A note:
I use a very simple model - just one dense layer.
no dropouts or batch normalizations to create any doubts as mentioned by @jeremydr2
@GalAvineri : Very interesting... I just came across your post, after posting my own frustrations: https://github.com/fchollet/keras/issues/5818. I will try to add shuffling to my evaluate_generator, to see if this makes any difference, and if I get the same results as you do, although I cannot see any sense in shuffling during model evaluation...
I had a similar problem using fit_generator with multiprocessing under Linux:
During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again.
Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses 'fork(2)' to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn't a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in __getitem__(self, idx).
Maybe this saves time for some of you.
[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/
validation_generator = test_datagen.flow_from_directory(shuffle=True)
, the shuffle=True
is set by default.
I found that using predict_generator(validation_generator) will return a shuffled result while validation_generator.classes and validation_generator.filenames does not shuffle the results... So the accuracy calculated after using predict_generator may cause wrong answer.
I also ran into a similar issue where my fit_generator
gives an accuracy and loss of 98% and 0.08. evaluate_generator
gives the same accuracy if I use rescale=1. / 255
but if I don't, I get an accuracy and loss of 50% and 7.9 respectively. predict_generator always gives only 50% accuracy even if I use rescale=1. / 255
or not. What should I do?
@jeremydr2 When you removed the batch normalization and dropout what was your accuracy and loss? Was it still 80% and 0.5? Because for me it fell to 50%
@GalAvineri I have a similar issue, but I do not think that the issue is about using shuffle but it is because of the rescale=1. / 255
in ImageDataGenerator
This issue is reproduced regularly while using fit_generator / evaluate_generator, and it seems pretty critical since it makes fit_generator output during training completely useless.
My problem was using loss='binary_crossentropy'
instead of loss='categorical_crossentropy'
. This caused my accuracy to be 96% before I even started training on my 15 different classes.
Just noting that for future googlers. Might not be relevant for this thread topic specifically. It might be worth to warn against this during evaluation in keras when a model has 2+ outputs and the accuracy is binary.
I had a similar problem using fit_generator with multiprocessing under Linux:
During training the loss was falling rapidly with implausibly high accuracies. However, these could in now way be reproduced when I tested the model on the same data. Even more strangely, when I turned off multiprocessing, accuracies were suddenly realistic again.
Turns out the problem was a combination of OS behavior and my data generator, which was internally doing some shuffling using np.random. Since Linux uses 'fork(2)' to spawn child processes and the initialization of the data generator was happening outside of the MP part, all workers were using the same seed and were generating equal batches. Note that this wasn't a problem under Windows, since here each child process is spun up independently [1]. Resolution was to seed np.random in getitem(self, idx).Maybe this saves time for some of you.
[1] http://rhodesmill.org/brandon/2010/python-multiprocessing-linux-windows/
I'm not sure I understand your solution. When you say, "Resolution was to seed np.random in getitem(self, idx)." Could you explain how you did this a little more thoroughly? How do you seed np.random in getitem(self,idx)?
Hey guys i'm having similar issues with predict_generator.
When i'm training i get around 92% train acc and 80% val_acc but when i make predictions and put them through a confusion matrix the acc drops to 50%, any updates on this
hi, same here - manually computed accuracy on this kaggle competition https://www.kaggle.com/c/aerial-cactus-identification gives me 50-65% accuracy (binary classification problem) while predict_generator
gives roughly 95%. I tried drop_duplicates=True, seed=2019, pickle_safe = True, workers=1 args and I'm using the correct loss='binary_crossentropy' for the model. I'm happy to give full details but the overall picture seems to be the same as above.
truth_generator = datagen.flow_from_dataframe(dataframe=df_truth, directory="test", x_col="id", y_col="has_cactus", class_mode="binary", target_size=(32,32), batch_size=200, drop_duplicates=True, seed=2019, pickle_safe = True, workers=1)
predictions = model.predict_generator(truth_generator, steps=20)
You need to add shuffle=False to flow_from_dataframe.
Hi,
I encountered the same issue recently, and actually the solution is quite simple.
You use validation_generator two times in a row, and I imagine your number of samples isn't exactly divisible by your batch size. Hence, your generator has a shift in its indices after you use it in model.evaluate_generator. So when you call it, the generator won't yield the sampels in the order you expect.
So you should create a second generator to use in model.predict_generator, or only evaluate your model via evaluate_generator or predict_generator :
validation_generator2 = test_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='binary', shuffle=False)
score = model.evaluate_generator(validation_generator, nb_validation_samples/batch_size, workers=12)
scores = model.predict_generator(validation_generator2, nb_validation_samples/batch_size, workers=12)
I am experiencing a similar problem, my model is trained using fit_generator
, saved using model.save
, loaded using load_model
and evaluated using evaluate_generator
but its accuracy is similar to the untrained one. However, model.predict
(without generator) works adequately well.
Using keras.__version__ = 2.2.4
.
this seems to be still a problem. One simple fix was not to use multi-process
I have encountered this problem today.
And I found the solution to this:
shuffle=False
in your generator.predict_generator()
function.valid_generator = datagen.flow_from_dataframe(
dataframe=train_df,
directory="../images/train/",
x_col="id",
y_col="label_2",
subset="validation",
batch_size=batch_size,
seed=42,
shuffle=False,
class_mode="categorical",
classes=classes,
target_size=(input_shape, input_shape))
step_size_valid = np.ceil(valid_generator.n / valid_generator.batch_size)
model.evaluate_generator(generator=valid_generator, steps=step_size_valid)
...
valid_generator.reset()
model.predict_generator(valid_generator, step_size_valid)
The predict_generator
on a custom data generator for me also produces a different result compared to eval_generator
and the verbose output from fit_generator
. I didn't shuffle the index in predict_generator
Same issue. Totally confused from the answers above. Any simple solution?
same issue. any simple solution from the official authors yet?
Similiar observations to @GalAvineri , when setting shuffle = False, my evaluate_generator accuracy goes up to 92%, which is great, but unrealistic. Setting it to true yields much more realistic results. I would appreciate some official guidelines on this - there are others who argue otherwise (@lipinski) which makes it incredibly confusing
Same issue. Totally confused from the answers above. Any simple solution?
Same issue/
Ok, I solved the issue.
My code previously:
scores_evaluation = model.evaluate_generator(test_generator.flow(X_test,Y_test, batch_size=32, shuffle=False),len(Y_test)/32)
scores_prediction = model.predict_generator(test_generator.flow(X_test, batch_size=32, shuffle=False),len(Y_test)/32)
However, in flow() shuffle is default to True, so for each flow function here, I added shuffle=False. Then problem solved.
Most helpful comment
validation_generator = test_datagen.flow_from_directory(shuffle=True)
, theshuffle=True
is set by default.I found that using predict_generator(validation_generator) will return a shuffled result while validation_generator.classes and validation_generator.filenames does not shuffle the results... So the accuracy calculated after using predict_generator may cause wrong answer.