I've 10100 images split into 101 folders(classes) that I want to store as single dataset using keras datagen.flow_from_directory.
Here is the code
datagen = ImageDataGenerator(rescale=1./255)
generator = datagen.flow_from_directory(
train_data_dir,
target_size=(150, 150),
batch_size=12,
class_mode='sparse',
shuffle=True)
However I want to shuffle the data while doing so,hence ideally generator.classes output should have been something like this
[0,
4,
5,
3,
2
.
.
....
0,
0,
.
100 such 0's followed by 100 such 1's ...upto 100 such 100.hence no random fashion is seen.
Any idea whether this is a bug or there is a work around for this.
Please make sure that the boxes below are checked before you submit your issue. Thank you!
generator.classes gives the class assigned to each sample based on the sorted order of folder names, you can check it here, It is just a list of length nb_samples (in your case 10100) with each field having sample's class index, they are not shuffled at this point.
The samples are shuffled with in the batch generator(here) so that when a batch is requested by the fit_generator or evaluate_generator random samples are given.
Hope this helps.
@varun-bankiti already gave the right answer.
In my case I wanted to get the classes for predictions of data that was flowing from a directory. Since I use different Generators for training and testing, I just added the parameter shuffle=False
and now it does not shuffle the predictions:
test_generator = ImageDataGenerator()
test_data_generator = test_generator.flow_from_directory(
"test_directory"),
batch_size=32,
shuffle=False)
This works for both shuffle=False
and shuffle=True
:
for batch_x, batch_y in generator:
# batch_x contains a batch of images
# batch_y contains a batch of classes in form of one-hots
Hope this helps.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Could you find a solution for this issue? I also want to shuffle may validation data in order to process a certain amount and then need the ground truth labels for the subset to plot a confusion matrix.
I raise a solution:
scores_train=np.zeros((0,10))
gt=np.zeros((0,10))
check_points=int(nb_train_samples/batch_size)
for check_point in range(check_points):
print('Batch '+ str(check_point))
imgs,labels = train_generator.next()
feats=model.predict(imgs)
scores_train=np.concatenate((scores_train,feats),axis=0)
gt=np.concatenate((gt,labels),axis=0)
Hopefully the above is useful for you.
I would also like to see this solved more directly. It seems that generating e.g. a confusion matrix from image classification is an extremely common use case and it would be nice not to have to glue something like this together every time.
I had the same problem and I looked into the Keras generator source code, to find out how exactly it shuffles the data.
The generator has an attribute named index_array
which is initialised to be None
, when the generator is first activated (first epoch) it checks if index_array is None
, and if so it sets index_array
to be a random permutation using the method _set_index_array
.
so one can initialise index_array
to be a chosen permutation and remember it like so:
import numpy as np
datagen = ImageDataGenerator(rescale=1./255)
generator = datagen.flow_from_directory(
train_data_dir,
target_size=(150, 150),
batch_size=12,
class_mode='sparse',
shuffle=True)
per = np.random.permutation(generator.n)
generator.index_array = per
classes = generator.classes[per]
But, it only works for the first epoch (which is great for predictions / evaluations but not for training)
If you wish for a solution that will work for every epoch you'll probably have to change the _set_index_array
method to record the last permutation
I also feel that this is pretty clunky as it is. The main problem is that the index array is generated per-batch and meant to be local to a thread (if you try to set it as an object attribute in next()
really weird things happen like your batches have zero size). But outside the generator we need to be able to keep track of the shuffling to monitor what happens for each sample.
The best option is to have the batch iterator report the index array it's using as an additional return value in next()
, but I imagine this triggers some deeper changes whose consequences I wouldn't have a good understanding of.
That said, a workaround is to not shuffle your data, which is fine for validation and test sets. Shuffling is truly effective as a training technique, but in image models there tends to be a good amount of data augmentation. While it is useful to know what the predictions would be on those, that seems to be difficult to do. You could create two separate generators for training, one with data augmentation that is shuffled and used for fitting the model, and another with no shuffling (I'd have to think how augmentation would affect this). Then use the one with no shuffling to save for further analysis.
A final workaround that is pretty ugly depending on how your project is structured: take a hash of the input image and compare with all input images you know you have for giving to the training data. It's not very computationally intensive as long as you choose a simple hash function, and it's a constant time lookup (so slight overhead added to each image prediction, essentially).
Simple solution:
val_generator.shuffle = False
val_generator.index_array = None
This will cause internal reset of index order without shuffling.
Most helpful comment
generator.classes gives the class assigned to each sample based on the sorted order of folder names, you can check it here, It is just a list of length nb_samples (in your case 10100) with each field having sample's class index, they are not shuffled at this point.
The samples are shuffled with in the batch generator(here) so that when a batch is requested by the fit_generator or evaluate_generator random samples are given.
Hope this helps.