Keras: In what order does 'flow_from_directory' function in keras takes the images?

Created on 23 Jul 2016  路  26Comments  路  Source: keras-team/keras

I'm trying to classify images into 10 classes. To get probabilities for images, I'm using model.predict_generator() function in keras. This returns only prediction values and not the corresponding sample ID(In this case image file name).

test_datagen = ImageDataGenerator(rescale=1./255)

validation_generator = test_datagen.flow_from_directory(
        '/path/',
        target_size=(256, 256),
        batch_size=32,`
        class_mode='categorical')

predictions = model.predict_generator(validation_generator, val_samples=10000)

How do I find the corresponding image name/id of the predictions?
(OR)
In what order does the '.flow_from_directory' read the samples?

Does the 'batch_size' and 'val_samples' arguments affect the order of predictions?

Click here for code

Most helpful comment

Try validation_generator.class_indices and validation_generator.classes. pprint it and see how its useful to you.

All 26 comments

Try validation_generator.class_indices and validation_generator.classes. pprint it and see how its useful to you.

@Vivek-B Were you able to resolve the issue? I am also facing a similar problem.
@marcj Do those functions exist in Keras? I tried searching for those, but did not find any matching search results.

@raghuramdr Yes. marcj's suggestion was useful.
Type 'validation_generator.' and then tab button. You'll get all the available functions.

@Vivek-B Thanks for the answer. How do you find the order in which flow_from_directory take the images. I am asking this because the accuracy computed from predict_generator and that obtained from a confusion matrix do not match.

@raghuramdr I suppose that the order like in generator.filenames list because it is used for iteration in next() method.

Exactly. As @Ershov-Alexander said, generator.filenames solve this problem!

i wish there was an example in the documentation for how to get test data into a csv file

@pengpaiSH @Vivek-B just to confirm:
Doing
predictions = model.predict_generator(validation_generator, val_samples=total_samples)

will go trough the whole dataset once and only once (assuming there are total_samples in the folder). To each prediction the 'real class' is validation_generator.classes.

I want to use this information to build a confusion matrix

I have exactly the same problem:

  • trying to classify images into N classes. The model is well trained and have ~97% accuracy

my generator.class_indices are:
{'classXX': 4, 'classXX': 6, 'classXX': 8, 'classXX': 3, 'classXX': 1, 'classXX': 7, 'classXX': 5, 'classXX': 2, 'classXX': 0}
classXX - are all different from each other

But when i predict a single image (even from training or validation set) my predictions are not corresponding to classes in class_indices
model.predict(x) returns some data, that do not correlate to class_indices above

Something that helped me, was to use a new generator each time. So doing

gen = image.ImageDataGenerator(shuffle=False, ...).flow_from_directory(...)
preds = model.predict_generator(gen, len(gen.filenames)

it works as expected, but if then I use the same gen again for some other task it seems it's shuffling the images (even if shuffle=False).

@hdmetor ohhh thanks for your information!
This do help me a lot...
I always got different result when I was using evaluate_generator and predict_generator.

Now I create image data generator again before using predict_generator, they have the same result finally..(Not totally the same but make more sense)

Guys I had a question about how the function validation_generator makes the labels..... what is actually happening under the hood.

gen = image.ImageDataGenerator(shuffle=False, ...).flow_from_directory(...)
preds = model.predict_generator(gen, len(gen.filenames)

This worked for me. I set up a test data directory with class folders and the test images in them. Although if I use model.predict on a single image I get totally different predictions.
Any ideas?

Just another comment, using keras 2.0.6 in a kaggle competition I see the same issue with the predict order.

If I use the code below to generate predictions, I get correct predictions the first time I call it, and apparently shuffled predictions on the second call:

> `test_gen = validgen.flow_from_directory(
>         test_data_dir,
>         target_size=(img_height, img_width),
>         batch_size=1,
>         class_mode='binary',
>         seed=0,
>         shuffle=False)
> 
> 
> preds = model_final.predict_generator(test_gen, 1531)
> print (preds[0:10])
> print (test_gen.filenames[0:10])
> 
> pred1 = model_final.predict_generator(test_gen, 1531)
> print (pred1[0:10])
> print (test_gen.filenames[0:10])`

`` Found 1531 images belonging to 1 classes.
[[ 1.00000000e+00]
[ 1.81016767e-05]
[ 9.99988794e-01]
[ 8.15555453e-03]
[ 3.15029087e-04]
[ 6.08354285e-02]
[ 5.00569877e-04]
[ 6.83458569e-03]
[ 9.88739491e-01]
[ 2.21080336e-04]]
['x/1305.jpg', 'x/570.jpg', 'x/508.jpg', 'x/1076.jpg', 'x/624.jpg', 'x/94.jpg', 'x/1128.jpg', 'x/137.jpg', 'x/795.jpg', 'x/555.jpg']

[[ 1.81016931e-05]
[ 9.99988794e-01]
[ 8.15555453e-03]
[ 3.15029087e-04]
[ 6.08354136e-02]
[ 5.00570342e-04]
[ 6.83457917e-03]
[ 9.88739491e-01]
[ 2.21080118e-04]
[ 1.81115605e-02]]
['x/1305.jpg', 'x/570.jpg', 'x/508.jpg', 'x/1076.jpg', 'x/624.jpg', 'x/94.jpg', 'x/1128.jpg', 'x/137.jpg', 'x/795.jpg',
``'x/555.jpg']

@brad0taylor It seems the generator.filename is always the same......but generator's output is not always ordered as that... It's so annoying.... so is there an explanation???????????????? @fchollet

Another thing I found is that yourpred1 array is shifted by one element of preds array?
It may mean something? maybe?

From @chenchennn 's comment maybe you should create the generator test_gen again before using it in anotherpredict_generator

Anyone found a usable solution to this? At least I know that I am not crazy :D
gen = image.ImageDataGenerator(shuffle=False, ...).flow_from_directory(...) - helped me, thanks for posting @hdmetor

no problem @XRarach
I think there is still no official support for that. another thing I did in the past (which is NOT efficient) is to grab the file path from the generator and then pass each one through the network. Obviously will work only for reasonable small datasets.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

About this issue, what about overwritting the "filenames" parameter, right after building the batches ?
From this:

https://github.com/fchollet/keras/blob/4a28dc5debcd0bd790ef29d202342184fde5a1f4/keras/preprocessing/image.py#L1101-L1110

to something like this

       filenames = [] 
        for i, j in enumerate(index_array):
            fname = self.filenames[j]
            filenames.append(fname)
            img = load_img(os.path.join(self.directory, fname),
                           grayscale=grayscale,
                           target_size=self.target_size,
                           interpolation=self.interpolation)
            x = img_to_array(img, data_format=self.data_format)
            x = self.image_data_generator.random_transform(x)
            x = self.image_data_generator.standardize(x)
            batch_x[i] = x
        self.filenames = filenames

What do you think?

This thread has several topics, but I am having a similar problem as some of you.

TLDR: Simply call datagen.reset() before you call model.predict_generator() to get the same order as datagen.filenames and datagen.classes.

Explanation: I am using flow_from_directory() on my data-generator and I need to know which image is associated with each prediction. The problem is that the data-generator will output batches for eternity so even if it is not shuffled, it might start at different positions in the dataset each time we use the generator, thus giving us seemingly different outputs. So you would need to exactly match the number of iterations and batches to the dataset-size, in order for it to restart at the beginning of the dataset in each call to predict_generator(). You can see the internal batch-counter using datagen.batch_index.

Somebody suggested to create a new data-generator before each call to predict_generator(). That should work, but it is a hack-around and there is a better and simpler way, which is to call reset() on the generator, and if you have set shuffle=False then it should start over from the beginning of the dataset and give the exact same output each time, so that the ordering now matches datagen.filenames and datagen.classes. This solved the problem for me.

The last suggestion about updating the list of filenames would be a major side-effect and it should probably be avoided. If you need something like that and datagen.reset() is not enough for you, then add another variable to the datagen-object, but don't overwrite the existing list of filenames.

Just another comment, using keras 2.0.6 in a kaggle competition I see the same issue with the predict order.

If I use the code below to generate predictions, I get correct predictions the first time I call it, and apparently shuffled predictions on the second call:

> `test_gen = validgen.flow_from_directory(
>         test_data_dir,
>         target_size=(img_height, img_width),
>         batch_size=1,
>         class_mode='binary',
>         seed=0,
>         shuffle=False)
> 
> 
> preds = model_final.predict_generator(test_gen, 1531)
> print (preds[0:10])
> print (test_gen.filenames[0:10])
> 
> pred1 = model_final.predict_generator(test_gen, 1531)
> print (pred1[0:10])
> print (test_gen.filenames[0:10])`

`` Found 1531 images belonging to 1 classes.
[[ 1.00000000e+00]
[ 1.81016767e-05]
[ 9.99988794e-01]
[ 8.15555453e-03]
[ 3.15029087e-04]
[ 6.08354285e-02]
[ 5.00569877e-04]
[ 6.83458569e-03]
[ 9.88739491e-01]
[ 2.21080336e-04]]
['x/1305.jpg', 'x/570.jpg', 'x/508.jpg', 'x/1076.jpg', 'x/624.jpg', 'x/94.jpg', 'x/1128.jpg', 'x/137.jpg', 'x/795.jpg', 'x/555.jpg']

[[ 1.81016931e-05]
[ 9.99988794e-01]
[ 8.15555453e-03]
[ 3.15029087e-04]
[ 6.08354136e-02]
[ 5.00570342e-04]
[ 6.83457917e-03]
[ 9.88739491e-01]
[ 2.21080118e-04]
[ 1.81115605e-02]]
['x/1305.jpg', 'x/570.jpg', 'x/508.jpg', 'x/1076.jpg', 'x/624.jpg', 'x/94.jpg', 'x/1128.jpg', 'x/137.jpg', 'x/795.jpg',
``'x/555.jpg']
why print "`Found 1531 images belonging to 1 classes."

Try validation_generator.class_indices and validation_generator.classes. pprint it and see how its useful to you.

Could you please give link of some example of a classification problem with more than two classes where these functions are used

@raghuramdr Yes. marcj's suggestion was useful.
Type 'validation_generator.' and then tab button. You'll get all the available functions.

Could you please give link of some example of a classification problem with more than two classes where these functions are used,with dataset used.I am trying to learn classification with more than two classes.I have a dataset with 100 classes in it each containing 10 images,need to know how can i apply a cnn on it.

Try validation_generator.class_indices and validation_generator.classes. pprint it and see how its useful to you.

It worked for me, I wanted to see which class was assigned to which folder like this:
print(validation_generator.class_indices)

Following could be done (for 2 classes) along with shuffle=False, batch_size = 32:

print("Number of Classes: ", training_set.num_classes)
print(training_set.class_indices)

cats_indx = np.where(training_set.labels == training_set.class_indices['cats'])[0][0]
dogs_indx = np.where(training_set.labels == training_set.class_indices['dogs'])[0][0]

print("First Index of Cat: ", cats_indx)
print("First Index of Dog: ", dogs_indx)

# print("This is the 6th (indx = 5) batch: ", training_set[5][1])

# finding the actual categorical label of our dogs
cat_batch_num = cats_indx//32
relative_index_of__first_cat = cat_batch_num % 32
cat_label = training_set[cat_batch_num][1][relative_index_of__first_cat]

# finding the actual categorical label of our dogs
dog_batch_num = dogs_indx//32
relative_index_of__first_dog = dog_batch_num % 32
dog_label = training_set[dog_batch_num][1][relative_index_of__first_dog]

print("Cats = ", cat_label)
print("Dogs = ", dog_label)
Was this page helpful?
0 / 5 - 0 ratings

Related issues

anjishnu picture anjishnu  路  3Comments

fredtcaroli picture fredtcaroli  路  3Comments

braingineer picture braingineer  路  3Comments

oweingrod picture oweingrod  路  3Comments

kylemcdonald picture kylemcdonald  路  3Comments