Keras: In what order does 'flow_from_directory' function in keras takes the images?

Created on 23 Jul 2016 · 26Comments · Source: keras-team/keras

I'm trying to classify images into 10 classes. To get probabilities for images, I'm using model.predict_generator() function in keras. This returns only prediction values and not the corresponding sample ID(In this case image file name).

test_datagen = ImageDataGenerator(rescale=1./255)

validation_generator = test_datagen.flow_from_directory(
        '/path/',
        target_size=(256, 256),
        batch_size=32,`
        class_mode='categorical')

predictions = model.predict_generator(validation_generator, val_samples=10000)

How do I find the corresponding image name/id of the predictions?
(OR)
In what order does the '.flow_from_directory' read the samples?

Does the 'batch_size' and 'val_samples' arguments affect the order of predictions?

Click here for code

Source

Vivek-B

👍25

Most helpful comment

Try validation_generator.class_indices and validation_generator.classes. pprint it and see how its useful to you.

marcj on 23 Jul 2016

👍36 👎14 😕1

All 26 comments

Try validation_generator.class_indices and validation_generator.classes. pprint it and see how its useful to you.

marcj on 23 Jul 2016

👍36 👎14 😕1

@Vivek-B Were you able to resolve the issue? I am also facing a similar problem.
@marcj Do those functions exist in Keras? I tried searching for those, but did not find any matching search results.

raghuramdr on 8 Sep 2016

@raghuramdr Yes. marcj's suggestion was useful.
Type 'validation_generator.' and then tab button. You'll get all the available functions.

Vivek-B on 8 Sep 2016

👍1

@Vivek-B Thanks for the answer. How do you find the order in which flow_from_directory take the images. I am asking this because the accuracy computed from predict_generator and that obtained from a confusion matrix do not match.

raghuramdr on 9 Sep 2016

@raghuramdr I suppose that the order like in generator.filenames list because it is used for iteration in next() method.

Ershov-Alexander on 3 Oct 2016

👍19

Exactly. As @Ershov-Alexander said, generator.filenames solve this problem!

pengpaiSH on 29 Nov 2016

👍19

i wish there was an example in the documentation for how to get test data into a csv file

burgalon on 10 Dec 2016

👍4

@pengpaiSH @Vivek-B just to confirm:
Doing
predictions = model.predict_generator(validation_generator, val_samples=total_samples)

will go trough the whole dataset once and only once (assuming there are total_samples in the folder). To each prediction the 'real class' is validation_generator.classes.

I want to use this information to build a confusion matrix

hdmetor on 15 Feb 2017

I have exactly the same problem:

trying to classify images into N classes. The model is well trained and have ~97% accuracy

my generator.class_indices are:
{'classXX': 4, 'classXX': 6, 'classXX': 8, 'classXX': 3, 'classXX': 1, 'classXX': 7, 'classXX': 5, 'classXX': 2, 'classXX': 0}
classXX - are all different from each other

But when i predict a single image (even from training or validation set) my predictions are not corresponding to classes in class_indices
model.predict(x) returns some data, that do not correlate to class_indices above

oleksandr-kovalov on 29 Mar 2017

👍1

Something that helped me, was to use a new generator each time. So doing

gen = image.ImageDataGenerator(shuffle=False, ...).flow_from_directory(...)
preds = model.predict_generator(gen, len(gen.filenames)

it works as expected, but if then I use the same gen again for some other task it seems it's shuffling the images (even if shuffle=False).

hdmetor on 30 Mar 2017

👍5

@hdmetor ohhh thanks for your information!
This do help me a lot...
I always got different result when I was using evaluate_generator and predict_generator.

Now I create image data generator again before using predict_generator, they have the same result finally..(Not totally the same but make more sense)

chenchennn on 24 Apr 2017

👍1

Guys I had a question about how the function validation_generator makes the labels..... what is actually happening under the hood.

mmnete on 1 Jun 2017

👍1

gen = image.ImageDataGenerator(shuffle=False, ...).flow_from_directory(...)
preds = model.predict_generator(gen, len(gen.filenames)

This worked for me. I set up a test data directory with class folders and the test images in them. Although if I use model.predict on a single image I get totally different predictions.
Any ideas?

Flowinger on 18 Jun 2017

Just another comment, using keras 2.0.6 in a kaggle competition I see the same issue with the predict order.

If I use the code below to generate predictions, I get correct predictions the first time I call it, and apparently shuffled predictions on the second call:

> `test_gen = validgen.flow_from_directory(
>         test_data_dir,
>         target_size=(img_height, img_width),
>         batch_size=1,
>         class_mode='binary',
>         seed=0,
>         shuffle=False)
> 
> 
> preds = model_final.predict_generator(test_gen, 1531)
> print (preds[0:10])
> print (test_gen.filenames[0:10])
> 
> pred1 = model_final.predict_generator(test_gen, 1531)
> print (pred1[0:10])
> print (test_gen.filenames[0:10])`

``Found 1531 images belonging to 1 classes.
[[ 1.00000000e+00]
[ 1.81016767e-05]
[ 9.99988794e-01]
[ 8.15555453e-03]
[ 3.15029087e-04]
[ 6.08354285e-02]
[ 5.00569877e-04]
[ 6.83458569e-03]
[ 9.88739491e-01]
[ 2.21080336e-04]]
['x/1305.jpg', 'x/570.jpg', 'x/508.jpg', 'x/1076.jpg', 'x/624.jpg', 'x/94.jpg', 'x/1128.jpg', 'x/137.jpg', 'x/795.jpg', 'x/555.jpg']

[[ 1.81016931e-05]
[ 9.99988794e-01]
[ 8.15555453e-03]
[ 3.15029087e-04]
[ 6.08354136e-02]
[ 5.00570342e-04]
[ 6.83457917e-03]
[ 9.88739491e-01]
[ 2.21080118e-04]
[ 1.81115605e-02]]
['x/1305.jpg', 'x/570.jpg', 'x/508.jpg', 'x/1076.jpg', 'x/624.jpg', 'x/94.jpg', 'x/1128.jpg', 'x/137.jpg', 'x/795.jpg',
``'x/555.jpg']

brad0taylor on 16 Jul 2017

👍11 ❤1

@brad0taylor It seems the generator.filename is always the same......but generator's output is not always ordered as that... It's so annoying.... so is there an explanation???????????????? @fchollet

Another thing I found is that yourpred1 array is shifted by one element of preds array?
It may mean something? maybe?

From @chenchennn 's comment maybe you should create the generator test_gen again before using it in anotherpredict_generator

ghost on 3 Aug 2017

Anyone found a usable solution to this? At least I know that I am not crazy :D
gen = image.ImageDataGenerator(shuffle=False, ...).flow_from_directory(...) - helped me, thanks for posting @hdmetor

XRarach on 17 Aug 2017

no problem @XRarach
I think there is still no official support for that. another thing I did in the past (which is NOT efficient) is to grab the file path from the generator and then pass each one through the network. Obviously will work only for reasonable small datasets.

hdmetor on 19 Aug 2017

The listed filenames are fixed.
https://github.com/fchollet/keras/blob/22e6bea8c2e23c6bbd6d98b4d3fe8b2e74c33c3d/keras/preprocessing/image.py#L1008
https://github.com/fchollet/keras/blob/22e6bea8c2e23c6bbd6d98b4d3fe8b2e74c33c3d/keras/preprocessing/image.py#L1004
https://github.com/fchollet/keras/blob/22e6bea8c2e23c6bbd6d98b4d3fe8b2e74c33c3d/keras/preprocessing/image.py#L859

The enumerated filenames are randomly shuffled.
https://github.com/fchollet/keras/blob/22e6bea8c2e23c6bbd6d98b4d3fe8b2e74c33c3d/keras/preprocessing/image.py#L1023
https://github.com/fchollet/keras/blob/22e6bea8c2e23c6bbd6d98b4d3fe8b2e74c33c3d/keras/preprocessing/image.py#L704
https://github.com/fchollet/keras/blob/22e6bea8c2e23c6bbd6d98b4d3fe8b2e74c33c3d/keras/preprocessing/image.py#L709

futurely on 23 Aug 2017

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 21 Nov 2017

About this issue, what about overwritting the "filenames" parameter, right after building the batches ?
From this:

https://github.com/fchollet/keras/blob/4a28dc5debcd0bd790ef29d202342184fde5a1f4/keras/preprocessing/image.py#L1101-L1110

to something like this

       filenames = [] 
        for i, j in enumerate(index_array):
            fname = self.filenames[j]
            filenames.append(fname)
            img = load_img(os.path.join(self.directory, fname),
                           grayscale=grayscale,
                           target_size=self.target_size,
                           interpolation=self.interpolation)
            x = img_to_array(img, data_format=self.data_format)
            x = self.image_data_generator.random_transform(x)
            x = self.image_data_generator.standardize(x)
            batch_x[i] = x
        self.filenames = filenames

What do you think?

romaintha on 7 Dec 2017

This thread has several topics, but I am having a similar problem as some of you.

TLDR: Simply call datagen.reset() before you call model.predict_generator() to get the same order as datagen.filenames and datagen.classes.

Explanation: I am using flow_from_directory() on my data-generator and I need to know which image is associated with each prediction. The problem is that the data-generator will output batches for eternity so even if it is not shuffled, it might start at different positions in the dataset each time we use the generator, thus giving us seemingly different outputs. So you would need to exactly match the number of iterations and batches to the dataset-size, in order for it to restart at the beginning of the dataset in each call to predict_generator(). You can see the internal batch-counter using datagen.batch_index.

Somebody suggested to create a new data-generator before each call to predict_generator(). That should work, but it is a hack-around and there is a better and simpler way, which is to call reset() on the generator, and if you have set shuffle=False then it should start over from the beginning of the dataset and give the exact same output each time, so that the ordering now matches datagen.filenames and datagen.classes. This solved the problem for me.

The last suggestion about updating the list of filenames would be a major side-effect and it should probably be avoided. If you need something like that and datagen.reset() is not enough for you, then add another variable to the datagen-object, but don't overwrite the existing list of filenames.

Hvass-Labs on 7 Dec 2017

👍16

Just another comment, using keras 2.0.6 in a kaggle competition I see the same issue with the predict order.

If I use the code below to generate predictions, I get correct predictions the first time I call it, and apparently shuffled predictions on the second call:
> `test_gen = validgen.flow_from_directory(
>         test_data_dir,
>         target_size=(img_height, img_width),
>         batch_size=1,
>         class_mode='binary',
>         seed=0,
>         shuffle=False)
> 
> 
> preds = model_final.predict_generator(test_gen, 1531)
> print (preds[0:10])
> print (test_gen.filenames[0:10])
> 
> pred1 = model_final.predict_generator(test_gen, 1531)
> print (pred1[0:10])
> print (test_gen.filenames[0:10])`
``Found 1531 images belonging to 1 classes.
[[ 1.00000000e+00]
[ 1.81016767e-05]
[ 9.99988794e-01]
[ 8.15555453e-03]
[ 3.15029087e-04]
[ 6.08354285e-02]
[ 5.00569877e-04]
[ 6.83458569e-03]
[ 9.88739491e-01]
[ 2.21080336e-04]]
['x/1305.jpg', 'x/570.jpg', 'x/508.jpg', 'x/1076.jpg', 'x/624.jpg', 'x/94.jpg', 'x/1128.jpg', 'x/137.jpg', 'x/795.jpg', 'x/555.jpg']

[[ 1.81016931e-05]
[ 9.99988794e-01]
[ 8.15555453e-03]
[ 3.15029087e-04]
[ 6.08354136e-02]
[ 5.00570342e-04]
[ 6.83457917e-03]
[ 9.88739491e-01]
[ 2.21080118e-04]
[ 1.81115605e-02]]
['x/1305.jpg', 'x/570.jpg', 'x/508.jpg', 'x/1076.jpg', 'x/624.jpg', 'x/94.jpg', 'x/1128.jpg', 'x/137.jpg', 'x/795.jpg',
``'x/555.jpg']
why print "`Found 1531 images belonging to 1 classes."

jerevon on 6 Feb 2019

Try validation_generator.class_indices and validation_generator.classes. pprint it and see how its useful to you.

Could you please give link of some example of a classification problem with more than two classes where these functions are used

keshavatgithub on 7 Feb 2019

@raghuramdr Yes. marcj's suggestion was useful.
Type 'validation_generator.' and then tab button. You'll get all the available functions.

Could you please give link of some example of a classification problem with more than two classes where these functions are used,with dataset used.I am trying to learn classification with more than two classes.I have a dataset with 100 classes in it each containing 10 images,need to know how can i apply a cnn on it.

keshavatgithub on 7 Feb 2019

Try validation_generator.class_indices and validation_generator.classes. pprint it and see how its useful to you.

It worked for me, I wanted to see which class was assigned to which folder like this:
print(validation_generator.class_indices)

yohannarodriguez on 14 Nov 2019

Following could be done (for 2 classes) along with shuffle=False, batch_size = 32:

print("Number of Classes: ", training_set.num_classes)
print(training_set.class_indices)

cats_indx = np.where(training_set.labels == training_set.class_indices['cats'])[0][0]
dogs_indx = np.where(training_set.labels == training_set.class_indices['dogs'])[0][0]

print("First Index of Cat: ", cats_indx)
print("First Index of Dog: ", dogs_indx)

# print("This is the 6th (indx = 5) batch: ", training_set[5][1])

# finding the actual categorical label of our dogs
cat_batch_num = cats_indx//32
relative_index_of__first_cat = cat_batch_num % 32
cat_label = training_set[cat_batch_num][1][relative_index_of__first_cat]

# finding the actual categorical label of our dogs
dog_batch_num = dogs_indx//32
relative_index_of__first_dog = dog_batch_num % 32
dog_label = training_set[dog_batch_num][1][relative_index_of__first_dog]

print("Cats = ", cat_label)
print("Dogs = ", dog_label)