I am already aware of some discussions on how to use Keras for very large datasets (>1,000,000 images) such as this and this. However, for my scenario, I can't figure out the appropriate way to use the ImageDataGenerator
or write my own dataGenerator
.
Specifically, I have the following four questions:
datagen.fit(X_sample)
, do we assume that X_sample
is a big enough chunk of data to calculate mean, perform feature centering/normalization and whitening on? X_sample
cannot obviously be the entire data, so will the augmentation (i.e. flipping, width/height shift) happen on partial data? For example, X_sample = 10000
out of total 1,000,000 pictures. After augmentation, suppose we get 2 * 10,000 more pictures. Note that we are not running datagen.fit()
again, so will our augmented data contain only 1,000,000 + 2 * 10,000
samples? How do we augment entire data (i.e. 1,000,000 + 2 * 1,000,000 samples)?My approach for building a data generator (for a very large data) which loops indefinitely is as follows (which fails):
def myGenerator() #this will give chunk of 10K pictures, 100 such chunks form entire dataset:
fileIndex=0
while 1:
# following loads data from HDF5 file numbered with fileIndex
(X_train, y_train) = LOAD_HDF5_OF_10K_SAMPLES(fileIndex)
fileIndex=fileIndex+1
if fileIndex == numOfHDF_files
fileIndex=0 #so that fileIndex wraps back and loop goes on indefinitely
The above code doesn't work in the sense that once it enters into the above function from fit_generator()
, it just stays in the while 1
loop forever. A detailed example will help a lot.
ImageDataGenerator
as in this link (which is preferable instead of writing our own), should we put (X_train, y_train), (X_test, y_test) = LOAD_10K_SAMPLES_OF_BIG_DATA()
in a for loop and write datagen.fit(X_train)
and model.fit_generator(datagen.flow(...))
in that loop?Hi there, my answers to your questions below:
I have some follow up questions @wongjingping :
Response to 3. I should have been clearer. I am not concerned about one "big" HDF5 file, the question is entire data can't be loaded as you say. You say that my example looks fine, but I think its wrong. I illustrate that with the following snippet of data generator (let's leave the data augmentation for later):
def myGenerator():
#loading data
(X_train, y_train), (X_test, y_test) = mnist.load_data()
#some preprocessing
y_train = np_utils.to_categorical(y_train,10)
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
while 1:
for i in range(1875):
if i%125==0:
print "i = " + str(i)
yield X_train[i*32:(i+1)*32], y_train[i*32:(i+1)*32]
The above function is called by fit_generator()
. Now, it should print till i=1750 and then train the model. However, it just keeps printing i=0 to i=1750 and then starts again from i=0 _without training the model_.
If I comment the line while 1
, it runs perfectly, but then it violates the assumption of infinite loop, doesn't it? Can you clear up my confusion by providing a concrete example or talking with respect to this example?
If you want a self-contained code snippet, it is as follows. You can just run it.
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten, Reshape
from keras.layers.convolutional import Convolution1D, Convolution2D, MaxPooling2D
from keras.utils import np_utils
def myGenerator():
(X_train, y_train), (X_test, y_test) = mnist.load_data()
y_train = np_utils.to_categorical(y_train,10)
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
while 1:
for i in range(1875): # 1875 * 32 = 60000 -> # of training samples
if i%125==0:
print "i = " + str(i)
yield X_train[i*32:(i+1)*32], y_train[i*32:(i+1)*32]
batch_size = 128
nb_classes = 10
nb_epoch = 12
# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
nb_pool = 2
# convolution kernel size
nb_conv = 3
model = Sequential()
model.add(Convolution2D(nb_filters, nb_conv, nb_conv,
border_mode='valid',
input_shape=(1, img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, nb_conv, nb_conv))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(nb_pool, nb_pool)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adadelta')
model.fit_generator(myGenerator(), samples_per_epoch = 60000, nb_epoch = 2, verbose=2, show_accuracy=True, callbacks=[], validation_data=None, class_weight=None, nb_worker=1)
Hi there,
I'm afraid I'm having some problems running your code with the debugger, but from what I can see I think you need to assign the generator instance to a new variable before passing it to the fit_generator()
method as below:
my_generator = myGenerator()
model.fit_generator(my_generator, samples_per_epoch = 60000, nb_epoch = 2, verbose=2, show_accuracy=True, callbacks=[], validation_data=None, class_weight=None, nb_worker=1)
Let me know if it doesn't work! (I'm not too sure myself)
I can run it in pycharm debugger after putting breakpoint inside my_generator()
function. However, I don't think running with a debugger is a good idea, in case of generators, that's why I am printing the value of i
. With the code I had, you should see i
going from 0 to 1750, then immediately warping back and printing 0 to 1750, repeat this _indefinitely_. Ideally, it should print till 1750, train, the again print till 1750, repeat this for NUMBER_OF_EPOCHS
times.
Anyway, with your suggestion, same thing happens, it just keeps repeating indefinitely. However, if I remove while 1
, it goes till 1750, trains and again goes back to 0 till 1750, trains, then terminates (as desired).
My apologies, I'm having some difficulty with the ipdb debugger in spyder, and resorted to another workaround.
Your model is training actually. You can add this snippet of code to verify (print out) the progress of your model using a callback:
class printbatch(Callback):
def on_batch_end(self, epoch, logs={}):
print(logs)
...
pb = printbatch()
# modify the fit_generator call to include the callback pb
model.fit_generator(myGenerator(), samples_per_epoch = 60000, nb_epoch = 2,
verbose=2, show_accuracy=True, callbacks=[pb], validation_data=None,
class_weight=None, nb_worker=1)
By i ~ 500 you should observe that the training accuracy printed out is at least 90%, with 100% accuracy appearing more frequently.
Without the callback, it is as you observed, i goes from 0 to 1750 and wraps back for the next epoch.
Hope this clarifies your doubts :)
@wongjingping Thanks. It works. Just one small doubt (rather observation) is that when the logs are printed, the numbers being printed in front of "Batch: " are actually sample numbers. So if there are 60000 samples / epoch, then in the logs being printed through callbacks, we saw "Batch: 59999".
Anyway, I am closing this issue now. I still haven't had success in running the data generators with >1 workers, but I have asked another question for that. You may take a look at it if time permits. That would be great. That is Issue #1638
Hi @parag2489,
I suspect there is a bug with the fit_generator in determining the batch_size, have raised this under a separate issue #1639. Feel free to chip in!
@wongjingping @parag2489 Hi~ May I ask you guys how to specify batch size if I wrote my own data generator since fit_generator
doesn't have the batch_size parameter and in the fit_generator
, we only yield one sample at a time.
@sunshineatnoon You can pass the batch_size as an argument to the generator:
```def generate_batch(epoch_size,batch_size):
``````
i = 0
while i < epoch_size:
# add in image reading/augmenting code here
yield X[i:i+batch_size,...],y[i:i+batch_size,...]
if i + batch_size > epoch_size:
i = 0
else:
i += batch_size```
``````
You might want to check out this link that introduces generators
@wongjingping Thanks! I will look into this. BTW, what does samples_per_epoch
mean in fit_generator
exactly? Say if I use a batch size of 64. Does this mean a total of 64*samples_per_epoch
is seen every epoch?
@sunshineatnoon sorry for the poor formatting; the samples_per_epoch is the number of examples you expect to see in an epoch, not batch_size * samples_per_epoch :)
@wongjingping So it means that if I use a batch size of 64, I will have samples_per_epoch / 64
batches per epoch? But when I specify batch_size
and generate a batch, my network training time slows down, it seems like it trains on more samples each epoch if I increase the batch_size
. Here is my generator:
def generate_batch_data(vocPath,imageNameFile,batch_size):
sample_number = 5000
class_num = 20
while 1:
for i in range(0,sample_number,batch_size):
#Read a batch of images from files
imageList = prepareBatch(i,i+batch_size,imageNameFile,vocPath)
#process imageList to np arrays images and boxes
yield np.asarray(images),np.asarray(boxes)
@sunshineatnoon samples_per_epoch
means fit_generator()
will stop asking samples from data generator. This is necessary since data generator has an infinite loop and has to be stopped somewhere. In another words, samples_per_epoch = batch_size * number_of_batches
.
Regarding why your training slows down, its best to profile your code. There is a feature in Theano for that (I think mode=Profile
). You can increase speed if you call prepareBatch()
for a large number of samples (large means that they can fit in your cpu RAM but not in GPU). Also, convert images
and boxes
in numpy array only once. Then just yield in batches of 32. In short, prepareBatch
and two calls to np.asarray
will go outside the for loop.
@parag2489 Thanks! It's very nice of you to give such a detailed explanation, I will try to change my code.
this page helps,thanks! btw,it seems that new version comes quickly
Just to be clear can someone confirm:
model.fit()
you specify the batch_size
so it knows how to break a finite data set (the corresponding x and y numpy arrays) into chunks for gradient calculation. 100% of the data set gets consumed each epoch.model.fit_generator()
the generator you provide should loop infinitely, the as samples_per_epoch
is basically giving a bound to total data samples to run through. The batch_size isn't specified as each tuple returned from the generator is a single batch. You control the size of the batch via the generator, so if you return one sample per yields, it's like setting a batch size of 1.Question:
What the heck is the use for the max_q_size? If the generator is handling the batching, why do you need another queue?
@raymondjplante
Q1. Your understanding of model.fit()
is correct.
Q2. Correct.
Even I am not sure what the max_q_size
. I think this answer has a mention of queue. So the queue is used to ensure that the generator is thread-safe.
You can also look at #1638 to see how to make the data generator thread-safe.
@parag2489 Someone on SO provides a good explanation of the purpose of the generator queue: http://stackoverflow.com/questions/36986815/in-keras-model-fit-generator-method-what-is-the-generator-queue-controlled-pa
@wongjingping @parag2489 For the case of multi-inputs, such as we have two pathways in the network, each corresponding to different input, Can we still use "data-generator" to generate image regions parallel with training process?
The problem I'm facing is keras fit_generator is good for processing images with collective size more than RAM size,but what if those files are actually not in image format.For example I've taken huge number of images(500k) and have used them against a pre-trained inception v3 model to get the feature out of them.Now each of those files are nothing but (1,384,8,8) array or npy files.Any idea how I can use fit generator to read them in batch as collectively they won't fit in my RAM and generators apparent don't recognize anything other than image files.
@tanayz It would be the exact same as if they were image instead of pickled/numpy data files:
len(slice) == batch_size
shape[0] == batch_size
; yield
databatch_size
is not a multiple of the number of files, such that the generator will always yield batch_size
number of examples@seasonwang I'm afraid I haven't tried that out before - sorry for the late reply!
Is there a way to use train_on_batch with a generator?
you means?
for batch in generator:
model.train_on_batch(batch)
Hi,
I have a question while using "predict_generator". How to ensure that the prediction is done on all test samples once.
For example-
predictions = model.predict_generator(
test_generator,
steps=int(test_generator.samples/float(batch_size)), # all samples once
verbose = 1,
workers = 2,
max_q_size=10,
pickle_safe=True
)
predicted_classes = np.argmax(predictions, axis=1)
true_classes = test_generator.classes
So the dimensions of predicted_classes and true_classes is different since total samples is not divisible by batch size.
The size of my test_set is not consistent, so the no. of steps in predict_generator would change each time depending upon the batch size. I am using flow_from_directory and cannot use predict_on_batch since my data is organized in a directory structure.
One solution is running with batch size of 1, but makes it very slow.
I hope my question is clear. Thanks in advance.
The comments and suggestions in this issue and its cousin #1638 were very helpful for me to efficiently process large numbers of images. I wrote it all up in a tutorial fashion that I hope can help others.
Hello,
I am trying to use model.fit_generator
with a custom Callback
that tries to access Validation data. However, whatever I do, when accessing validation data from within the Callback, it always equates to None.
class RecallMetrics(Callback):
def on_train_begin(self, logs=None):
print('RecallMetrics ... validating')
self.val_f1s = []
self.val_recalls = []
self.val_precisions = []
def on_epoch_end(self, epoch, logs=None):
x=(self.validation_data[0])
if x is None :
print ('Error: validation_data is None')
return
else:
val_predict = (np.asarray(self.model.predict(self.validation_data[0]))).round()
val_targ = self.validation_data[1]
_val_f1 = f1_score(val_targ, val_predict)
_val_recall = recall_score(val_targ, val_predict)
_val_precision = precision_score(val_targ, val_predict)
self.val_f1s.append(_val_f1)
self.val_recalls.append(_val_recall)
self.val_precisions.append(_val_precision)
print (" — val_f1: % f — val_precision: % f — val_recall % f" % (_val_f1, _val_precision, _val_recall))
return
history = model.fit_generator(generator=train_gen,
validation_data=validate_gen,
# validation_data=None,
steps_per_epoch=len(train_file_list),
validation_steps=len(val_file_list) * 3,
verbose=2,
epochs=int(tc.config["LUNA16"]["epochs"]),
callbacks=callbacks,
workers=multiprocessing.cpu_count(),
use_multiprocessing=True)
How can I access validation data from a custom Callback when using fit_generator?
Best,
you means?
for batch in generator: model.train_on_batch(batch)
Hi,
Tried using this but got the following error:
dloss_real = disc.train_on_batch(dataBatch, valid)
File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 1211, in train_on_batch
class_weight=class_weight)
File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 751, in _standardize_user_data
exception_prefix='input')
File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training_utils.py", line 92, in standardize_input_data
data = [standardize_single_array(x) for x in data]
File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training_utils.py", line 92, in <listcomp>
data = [standardize_single_array(x) for x in data]
File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training_utils.py", line 27, in standardize_single_array
elif x.ndim == 1:
AttributeError: 'tuple' object has no attribute 'ndim'
Most helpful comment
I have some follow up questions @wongjingping :
Response to 3. I should have been clearer. I am not concerned about one "big" HDF5 file, the question is entire data can't be loaded as you say. You say that my example looks fine, but I think its wrong. I illustrate that with the following snippet of data generator (let's leave the data augmentation for later):
The above function is called by
fit_generator()
. Now, it should print till i=1750 and then train the model. However, it just keeps printing i=0 to i=1750 and then starts again from i=0 _without training the model_.If I comment the line
while 1
, it runs perfectly, but then it violates the assumption of infinite loop, doesn't it? Can you clear up my confusion by providing a concrete example or talking with respect to this example?If you want a self-contained code snippet, it is as follows. You can just run it.