Keras: How to get classes from generator in case of Shuffle=True

Created on 29 Oct 2016 · 10Comments · Source: keras-team/keras

I've 10100 images split into 101 folders(classes) that I want to store as single dataset using keras datagen.flow_from_directory.

Here is the code

datagen = ImageDataGenerator(rescale=1./255)

generator = datagen.flow_from_directory(
        train_data_dir,
        target_size=(150, 150),
        batch_size=12,
        class_mode='sparse',
        shuffle=True)

However I want to shuffle the data while doing so,hence ideally generator.classes output should have been something like this
[0,
4,
5,
3,
2
.
.
....] ...and 101 such column each with 100 elements vertically one under other.However the actual class out put isn't showing any random fashion and rather showing the value like this
0,
0,
.
100 such 0's followed by 100 such 1's ...upto 100 such 100.hence no random fashion is seen.
Any idea whether this is a bug or there is a work around for this.

Please make sure that the boxes below are checked before you submit your issue. Thank you!

[ ] Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
[ ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
[ ] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

stale

Source

tanayz

Most helpful comment

generator.classes gives the class assigned to each sample based on the sorted order of folder names, you can check it here, It is just a list of length nb_samples (in your case 10100) with each field having sample's class index, they are not shuffled at this point.

The samples are shuffled with in the batch generator(here) so that when a batch is requested by the fit_generator or evaluate_generator random samples are given.

Hope this helps.

varun-bankiti on 31 Oct 2016

👍4

All 10 comments

The samples are shuffled with in the batch generator(here) so that when a batch is requested by the fit_generator or evaluate_generator random samples are given.

Hope this helps.

varun-bankiti on 31 Oct 2016

👍4

@varun-bankiti already gave the right answer.

In my case I wanted to get the classes for predictions of data that was flowing from a directory. Since I use different Generators for training and testing, I just added the parameter shuffle=False and now it does not shuffle the predictions:

test_generator = ImageDataGenerator()
test_data_generator = test_generator.flow_from_directory(
    "test_directory"),
    batch_size=32,
    shuffle=False)

apacha on 18 May 2017

This works for both shuffle=False and shuffle=True:

for batch_x, batch_y in generator:
        # batch_x contains a batch of images
        # batch_y contains a batch of classes in form of one-hots

Hope this helps.

jwc-rad on 30 Jul 2017

👎15

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 28 Oct 2017

Could you find a solution for this issue? I also want to shuffle may validation data in order to process a certain amount and then need the ground truth labels for the subset to plot a confusion matrix.

RT-TL on 10 Mar 2018

I raise a solution:
scores_train=np.zeros((0,10))
gt=np.zeros((0,10))
check_points=int(nb_train_samples/batch_size)
for check_point in range(check_points):
print('Batch '+ str(check_point))
imgs,labels = train_generator.next()
feats=model.predict(imgs)
scores_train=np.concatenate((scores_train,feats),axis=0)
gt=np.concatenate((gt,labels),axis=0)
Hopefully the above is useful for you.

laukun on 30 Apr 2018

I would also like to see this solved more directly. It seems that generating e.g. a confusion matrix from image classification is an extremely common use case and it would be nice not to have to glue something like this together every time.

jessetrana on 20 May 2018

I had the same problem and I looked into the Keras generator source code, to find out how exactly it shuffles the data.
The generator has an attribute named index_array which is initialised to be None, when the generator is first activated (first epoch) it checks if index_array is None, and if so it sets index_array to be a random permutation using the method _set_index_array.
so one can initialise index_array to be a chosen permutation and remember it like so:

import numpy as np

datagen = ImageDataGenerator(rescale=1./255)

generator = datagen.flow_from_directory(
        train_data_dir,
        target_size=(150, 150),
        batch_size=12,
        class_mode='sparse',
        shuffle=True)

per = np.random.permutation(generator.n)
generator.index_array = per
classes = generator.classes[per]

But, it only works for the first epoch (which is great for predictions / evaluations but not for training)
If you wish for a solution that will work for every epoch you'll probably have to change the _set_index_array method to record the last permutation

ofermagen98 on 29 Aug 2018

👍3

I also feel that this is pretty clunky as it is. The main problem is that the index array is generated per-batch and meant to be local to a thread (if you try to set it as an object attribute in next() really weird things happen like your batches have zero size). But outside the generator we need to be able to keep track of the shuffling to monitor what happens for each sample.

The best option is to have the batch iterator report the index array it's using as an additional return value in next(), but I imagine this triggers some deeper changes whose consequences I wouldn't have a good understanding of.

That said, a workaround is to not shuffle your data, which is fine for validation and test sets. Shuffling is truly effective as a training technique, but in image models there tends to be a good amount of data augmentation. While it is useful to know what the predictions would be on those, that seems to be difficult to do. You could create two separate generators for training, one with data augmentation that is shuffled and used for fitting the model, and another with no shuffling (I'd have to think how augmentation would affect this). Then use the one with no shuffling to save for further analysis.

A final workaround that is pretty ugly depending on how your project is structured: take a hash of the input image and compare with all input images you know you have for giving to the training data. It's not very computationally intensive as long as you choose a simple hash function, and it's a constant time lookup (so slight overhead added to each image prediction, essentially).

hadsed on 29 Jan 2019

👍1

Simple solution:

val_generator.shuffle = False
val_generator.index_array = None

This will cause internal reset of index order without shuffling.

mendi80 on 15 Dec 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

In training process, validation data are necessary?

Imorton-zd · 3Comments

Can we define each time step of a RNN with different length?

NancyZxll · 3Comments

keras crashing when using convolutions

braingineer · 3Comments

New to Keras, how to format image data in numpy arrays for training?

oweingrod · 3Comments

New predict API for multiple outputs

snakeztc · 3Comments