Keras: Why the return of predict and predict generator are different?

Created on 15 Aug 2016 · 19Comments · Source: keras-team/keras

Hello, I'm using lately ImageDataGenerator to be able to use dataser larger. So I am trying to replicate the same example using dataArgumentation, I get the same accuracy for the two examples, but the return of model.predict and model.predict_generator are completely different. Does anyone know why this happens? Remembering that I know they can be different but in this case is completely different. Below the two examples with the same architecture.

dataArgumentation:

`   model.compile(loss="categorical_crossentropy", optimizer="Adadelta",metrics=['accuracy'])
    train_datagen = ImageDataGenerator(rescale=1./255)

    test_datagen = ImageDataGenerator(rescale=1./255)

    train_generator = train_datagen.flow_from_directory(
        trainPath,
        target_size=(blockWidth, blockHeight),
        color_mode='grayscale',
        batch_size=batchSize,
        class_mode='categorical',
        shuffle=True)

    validation_generator = test_datagen.flow_from_directory(
        testPath,
        target_size=(blockWidth, blockHeight),
        color_mode='grayscale',
        batch_size=batchSize,
        class_mode='categorical',
        shuffle=False)

    model.fit_generator(
        train_generator,
        samples_per_epoch=10500,
        nb_epoch=epoch,
        validation_data=validation_generator,
        nb_val_samples=5250)

    scoreSeg = model.evaluate_generator(validation_generator,5250)
    print("Accuracy = ",scoreSeg[1])
    predict = model.predict_generator(validation_generator,5250)

No dataArgumetation:

def loadDatabase():

    numSamplesTrain = float(numImgClass*(float(train)/100))
    numSamplesTrain = round(numSamplesTrain)

    dataTrain = []
    labelTrain = []
    dataTest = []
    labelTest = []
    filesTest = []
    filesCount = 0
    patchesCount = 0

    for c in range(1,classes+1):
        filesTest.append([])

        for s in range(1,numImgClass+1):

            if(s < numSamplesTrain+1):
                folderTrainTest = 'Treino/'
            else:
                folderTrainTest = 'Teste/'

            for b in range(1,numblock+1):
                nameImg = preName + str(c).zfill(5) + sep + str(s) + sep + str(b)
                folderClass = preName + str(c).zfill(5) + '/'
                fullPathImg = pathImages + folderClass + folderTrainTest + nameImg + imagesFormat
                image = plt.imread(fullPathImg)

                image = image[np.newaxis]

                if(folderTrainTest == 'Treino/'):
                    dataTrain.append(image)
                    labelTrain.append(c-1)
                else:
                    dataTest.append(image)
                    labelTest.append(c-1)
                    filesTest[filesCount].append(patchesCount)
                    patchesCount += 1
        filesCount+=1

    #train
    dataTrain = np.array(dataTrain)
    labelTrain = np.array(labelTrain)

    LUT = np.arange(len(dataTrain), dtype=int)
    random.shuffle(LUT)
    randomDataTrain = dataTrain[LUT]
    randomLabelTrain = labelTrain[LUT]

    X_Train = randomDataTrain.astype("float32")/255
    Y_Train = np_utils.to_categorical(randomLabelTrain, classes)

    #test       
    dataTest = np.array(dataTest)
    labelTest = np.array(labelTest)
    filesTest = np.array(filesTest)

    X_Test = dataTest.astype("float32")/255
    Y_Test = np_utils.to_categorical(labelTest, classes)

    print("Number samples train: ",numSamplesTrain*numblock*classes)
    print("Number samples test: ",(numImgClass-numSamplesTrain)*numblock*classes)       

    return X_Train, Y_Train, X_Test, Y_Test, filesTest

main():
    X_Train, Y_Train, X_Test, Y_Test, filesTest = loadDatabase()

    model = buildModel()

    model.compile(loss="categorical_crossentropy", optimizer="Adadelta",metrics=['accuracy'])
    model.fit(X_Train, Y_Train, nb_epoch=epoch, batch_size=batchSize, verbose=1, validation_data=(X_Test, Y_Test))
    scoreSeg = model.evaluate(X_Test, Y_Test, verbose=1)
    predict = model.predict(X_Test, verbose=1)

Source

guirighetto

👍13

Most helpful comment

I had similar issue and after a few of work I found that to have a correct match I had to:

remove all transformations (i.e. rescale)
set shuffle = False, in order to have same sequences than original (otherwise isn't clear the order of samples vs result)
batch_size same of number of samples, in order to have exact number of samples as original
After those change, the comparison with "manual" provided samples and prediction match exactly.

EnricoBeltramo on 27 Aug 2017

👍14 🎉3 😄2

All 19 comments

Hi
Did you get any solution for this??

myway0101 on 16 Jan 2017

No, to solve this problem I implemented algorithm to load batch images into memory, therefore I dont used dataArgumentation.

guirighetto on 17 Jan 2017

I have the same problem: predict and predict generator resulting in very different results. Using keras.applications.vgg16. 'predict' results are correct whereas 'predict_generator' results are totally different and wrong. I debugged input from ImageDataGenerator and it looks ok.

Problem found: Must not use rescale=1./255 in ImageDataGenerator.

sysid on 4 Mar 2017

👍14 😕9 ❤1

@sysid Could you explain why we shouldn't use rescale? How else will the model know to do that rescale on the test data. Furthermore, this doesn't really explain what is the difference between these two functions. Please elaborate if you remember

Aakash282 on 9 May 2017

@Aakash282 VGG was trained on demeaned but not rescaled data.

sakvaua on 14 May 2017

ok sure. But i think this problem extends past that one model/dataset. I trained a cnn using a generator and validation_generator. So far, the only way to get remotely consistent results is to use a generator for my test data but I want to actually know which results map to which images. I tried using batch but that didn't really work either.

Aakash282 on 15 May 2017

@guirighetto : I also got same problem. After some trials, then I know that the batch size should be a number that can divide the size of training data and validation data. For example, training data : 1000 and validation data: 100, batch size that is allowed for example is 10. If you use bath size for example 15, your training data will be only 990 and validation data will be only 90, which are biggest number that are divisible by 15.

ardianumam on 9 Jul 2017

👍4 👎3

@ardianumam you can also increment number of steps by one and select first n elements, where n is number of training samples

sakvaua on 9 Jul 2017

@sakvaua : Thanks, Man! It works. For the case I have 100 validation images, and batch_size : 15, then I do "bottleneck_features_validation = model.predict_generator(generator,
((validation_samples // batch_size)+1))", where "validation_samples // batch_size = 100//15 = 6". If without "+1" I only get 6x15 = 90. If with "+1", then I get 7x15= 105, and since my val images only 100, thus, is it automatically stop at 100? I mean, not become 105 by taking any 5 images randomly again.

ardianumam on 10 Jul 2017

🎉1

I read over these comments, and I am getting the same issue for vgg16.

The code block uses pre-trained vgg16 to return features for train_img (N=1798).
The output of predict_generator (pred_0) is different than predict (pred_0)!

batch_size = 64

model = VGG16(input_shape=train_img[0].shape, 
              weights='imagenet', 
              include_top=False)

for layer in model.layers:
    layer.trainable = False

gen = ImageDataGenerator()

pred_0 = model.predict_generator(
    gen.flow(train_img, batch_size=batch_size),
    steps=(len(train_img) // batch_size) + 1, # 25
    verbose=1
)
pred_1 = model.predict(train_img, verbose=1)

np.array_equal(pred_0[0], pred_1[0]) # False
pred_0.shape # (1598, 7, 7, 512)
pred_1.shape # (1598, 7, 7, 512)

I'm using Python 3.6, Keras 2.06 w/ TF backend

yinleon on 2 Aug 2017

👍6 😕1

I had similar issue and after a few of work I found that to have a correct match I had to:

remove all transformations (i.e. rescale)
set shuffle = False, in order to have same sequences than original (otherwise isn't clear the order of samples vs result)
batch_size same of number of samples, in order to have exact number of samples as original
After those change, the comparison with "manual" provided samples and prediction match exactly.

EnricoBeltramo on 27 Aug 2017

👍14 🎉3 😄2

Hi All,

I had the same problem but I solved it! The clue is to find an integer number for the batch_size that could divide both of nb_train_samples and nb_validation_samples; i.e.,
nb_train_samples//batch_size = 0 and nb_validation_samples//batch_size = 0.
As an example, if:
nb_train_samples = 2400
nb_validation_samples = 1200
then:
batch_size = 4 or 6 or....... or 48 or ..... .

or, simply use nb_validation_samples // batch_size + 1, i.e., use the following code:

y_pred = model.predict_generator(validation_generator, nb_validation_samples // batch_size + 1, verbose = 1)
y_pred = np.argmax(y_pred, axis = 1)

YasserMustafa on 21 Nov 2017

👍5

The output from predict() and predict_generator() are actually identical, but they look different because they are labelled differently. You are providing the labels to predict(), and predict_generator() is inferring the labels (because it's using flow_from_directory() instead of flow()) from the directory structure of training data. These can be very different, as I explain in this blogpost.

Here is some code to convert the mapping from predict_generator() to the one you supplied to predict() so that you can see that they are identical:

import numpy as np
predictions = model.predict_generator(self.test_generator)
predictions = np.argmax(predictions, axis=-1) #multiple categories

label_map = (train_generator.class_indices)
label_map = dict((v,k) for k,v in label_map.items()) #flip k,v
predictions = [label_map[k] for k in predictions]

Someone should now close this issue.

soumendra on 24 Jan 2018

👍15 👎6

If you are getting unexpected results from predict_generator() look into the following cases:

Make sure that your generator is not augmenting validation/test data when instantiating ImageDataGenerator
Make sure that shuffle is false when calling flow or flow_from_directory for test data
Make sure that the files are read in the same order as you want by flow_from_directory

Augmenting the data

When you are reading images using ImageDataGenerator, there are arguments to the class constructor that activate the "augmentation" later when the images are returned as sample points. Here are a bunch of them:

rotation_range
width_shift_range
height_shift_range
shear_range
zoom_range
horizontal_flip

While augmentation is something you want to look into for training data, you don't want it for validation and/or test data. So make sure you are not going to set these parameters for the instance of the ImageDataGenerator that you are going to use to read validation and/or test data.

Shuffling data

Again, this parameter is something useful only for training data and not for validation and/or test data (as for validation it really does not matter, but it does no good either). What you need to know is that for training and validation, the sample points are accompanies with labels so when they are shuffled, their coresponding labels are shuffled with the sample points so the bond is not broken. But in case of test data, the ralation between sample points and their labels is based on their order (at least usually). So if you shuffle the test data, you are breaking that bond.

How files are read

This I found through test and debugging and it's again related to the order of test data. Consider your data points are images and they are stored on hard disk as files. Usually, the order of files (based on their filenames) is assumed and the labels are stored (for instance all in a CSV file) following the exact same order as filenames. But there's a problem here. Ordering of filenames (strings in general) could be done in different ways. Consider the following two cases:

1.jpg
2.jpg
10.jpg

And

1.jpg
10.jpg
2.jpg

Your OS's file manager will most likely use the first ordering of files when you open the container folder. But flow_from_directory method will read them ordered with the second method. As I'm sure you can see it yourself, this will lead to the same problem as the one before. The bond between data points and their labels will be broken. The solution to this is to rename the files so flow_from_directory will read them as the first case:

01.jpg
02.jpg
10.jpg

You can do this using a simple peice of code. But you need to know how many files you have beforehand so you can pad 0s to the left of all filename according to that.

After applying these steps, I managed to get the correct output out of predict_generator.

ziadloo on 7 Aug 2018

👍12

-- UPDATED
This issue of this page can be separated in two parts with two causes. The first cause generated initially this page. Then, another anomaly was observed with a similar effect but a different cause.
The problem: predict_generator provides occasionally results which conflict with evaluate_generator, predict and possibly it is also conflicting with other types of congruency checks we might find, such as manual checks on the accuracy.

Cause n. 1
Originally the issue reported was recognized as being due to the improper dimensioning of the steps or batch size so that the prediction_generator was not collecting the entire sample, but omitting the last unfilled batch (guirighetto). If N is the input number of images that we want to predict, we should correctly set,

steps=np.ceil(N/batchsize)+1

or any equivalent way to calculate it, such as utilizing the module //. In any case, we can make a safety check after calling

pred=model.predict_generator(test_iter,verbose=0, steps=steps)
pred.shape
(N,1)

if we get a number smaller the N, then we are omitting some images from the sample and the Accuracy may be impacted.

Cause n. 2
Even after properly designing the batch and steps, the problem however persisted for many.
I must admit that initially I got it wrong as well and my initial post was not correct. The error was not easy to find as many factors contributed to cause n. 2 (I will not further split all its contributing factors in multiple subcauses, lets just call it Cause n. 2)

First, the ordering of the data set is important as highlighted by ziadloo, soumendra and others.
Linked to this, it should be shuffling = False. We cannot properly order our dataset in an order friendly mode, valid presumably for many OS and then shuffle with the iterator!
Also test_iter.reset() is a wise thing to do, we might have tested something, leaving our iterator in a position different from the start.

A safety check on the files ordering is:

test_iter.filenames
['cats_dogs\001.jpg',
'cats_dogs\002.jpg',
'cats_dogs\003.jpg',
…]

if the list of filenames looks disordered or different from what expected, watch out, to avoid confusion use a robust naming for your files.
While for many the issue seemed closed, for others it wasn't. There was another trap hiding: rescale=1./255

It seemed that if we were setting:

test_datagen = ImageDataGenerator(rescale=1./255)

then we were back to the nightmare, nothing matched anymore. And so many recommendations: "Danger: do not rescale!". Well, after various tests I concluded the same thing in my original contribution to this post :-)
But I wasn't happy. After all, Aakash282 was right. Why not rescaling?!?! It doesn't make any sense, we usually do rescale the train data, so we should set the same conditions for testing. Rescaling is not really data augmentation, it is just image pre-processing. If a model was learned with rescaling, that is its environment, it should predict rescaled images. All this is very rational, however,..., without rescaling there is a miracle: Everything fits. Exhausted of so much struggle, many (me as well) thought "OK it doesn't make sense, but it works, let's move on..."
I thought initially that, rescaling images was recommended, but at times omitted so, may be, models were not really impacted. After some days I made this simple test

predict(my_Image) == predict(my_rescaled_Image)
False

Not really a big surprise here, the input data is different, however when you see it for real you start focusing on the real issue. I was comparing predict_generator versus predict and while predict generator does much of the preparatory stuff for us, with predict we need to load an image and transform it to numpy, reshape it,..., AND,... Rescale it! Back to the code, the Rescaling was indeed missing when working with predict, but it was present in predict_generator. Including one line for rescaling the images presented to predict: Bingo. it worked perfectly, no need to omit rescaling in the predict_generator.

Many other traps may be lurking, but for sure rescaling is not an issue, it is, if you do it only in one process and not in the other that you are using for double check.

Below the correct code for the predict_generator configuration. Use rescale with no fear, it works.

test_datagen = ImageDataGenerator(rescale=1./255)

testdir = 'C:/MyDocuments/'

test_iter = test_datagen.flow_from_directory(
     testdir,
     target_size=(150, 150), shuffle=False,
     batch_size=16,
     class_mode=None)

test_iter.reset()
pred = model.predict_generator(test_iter,verbose=0, steps=steps)

Below there is the correct code for loading 100 images, preprocess them and predict their labels, one by one, with a routine using predict. st_img_x builds the file name in the friendly order 001.jpg, 002.jpg,...,100.jpg

mylist = []
for i in range(1,101):
     st_img_x = testdir + ('0000' + str(i))[-3:] + '.jpg'
     img_pil_1 = Image.open(st_img_x)
     img_pil_150 = img_pil_1.resize((150,150))
     img_np_150 = np.asarray(img_pil_150)
     img_np_150 = img_np_150.astype('float32')
     img_np_150 = img_np_150/255 # normalized to [0,1]
     img_np_150_rsh = img_np_150.reshape(1,150,150,3)
     mylist.append(model.predict(img_np_150_rsh)[0])
myarr = np.rint(np.array(mylist).flatten()).astype(int)

My missing line was of course img_np_150 = img_np_150/255 which should normalized images values to [0,1]
I can bet that anybody which matches predict_generator and predict only when setting no rescale in the iterator, he is not preprocessing with rescale its images for predict.

atortoricimontaperto on 23 Nov 2018

👍5

A small addition:

As @atortoricimontaperto says (thanks!), rescaling can be used with predict_generator, when you also take into account the rescaling in predict (instead of predict_generator).
In addition: The size of the validation data set does not HAVE to be an exact multiple of the batch size (although it makes it easier). In that case, you can set "shuffle=False" and call "mygen.reset()" which sets "mygen.batch_index=0" before calling "predict_generator(mygen). Then, the generator should start from index 0 again and iterate alphabetically. Worked for me at least.

Before, the results gave me lots of headaches: I called the generator "mygen" implicitly beforehand during training. Since the dataset size was not a perfect multiple of batch_size, it seemed to have stopped at an index != 0. When called again, it started there as well so that the predicted labels did not match the true labels at all...