Keras: A concrete example for using data generator for large datasets such as ImageNet

Created on 3 Feb 2016 · 28Comments · Source: keras-team/keras

I am already aware of some discussions on how to use Keras for very large datasets (>1,000,000 images) such as this and this. However, for my scenario, I can't figure out the appropriate way to use the ImageDataGenerator or write my own dataGenerator.

Specifically, I have the following four questions:

From this link: when we do datagen.fit(X_sample), do we assume that X_sample is a big enough chunk of data to calculate mean, perform feature centering/normalization and whitening on?
From the same previous link, Another thing is, X_sample cannot obviously be the entire data, so will the augmentation (i.e. flipping, width/height shift) happen on partial data? For example, X_sample = 10000 out of total 1,000,000 pictures. After augmentation, suppose we get 2 * 10,000 more pictures. Note that we are not running datagen.fit() again, so will our augmented data contain only 1,000,000 + 2 * 10,000 samples? How do we augment entire data (i.e. 1,000,000 + 2 * 1,000,000 samples)?
This is about fetching data from the manually written generator. Since the data is so large, it won't fit into one big HDF5 file, so we split it into 8 files. Now, the data generator has to run in an infinite loop. This and this answer mentions data generators, but a concrete example of it will be more helpful.

My approach for building a data generator (for a very large data) which loops indefinitely is as follows (which fails):

def myGenerator() #this will give chunk of 10K pictures, 100 such chunks form entire dataset:
    fileIndex=0
    while 1:
            # following loads data from HDF5 file numbered with fileIndex
            (X_train, y_train) = LOAD_HDF5_OF_10K_SAMPLES(fileIndex)
            fileIndex=fileIndex+1
            if fileIndex == numOfHDF_files
                fileIndex=0 #so that fileIndex wraps back and loop goes on indefinitely

The above code doesn't work in the sense that once it enters into the above function from fit_generator(), it just stays in the while 1 loop forever. A detailed example will help a lot.

Also, if we are to use ImageDataGenerator as in this link (which is preferable instead of writing our own), should we put (X_train, y_train), (X_test, y_test) = LOAD_10K_SAMPLES_OF_BIG_DATA() in a for loop and write datagen.fit(X_train) and model.fit_generator(datagen.flow(...)) in that loop?

Source

parag2489

👍30 👎2

Most helpful comment

I have some follow up questions @wongjingping :

Response to 3. I should have been clearer. I am not concerned about one "big" HDF5 file, the question is entire data can't be loaded as you say. You say that my example looks fine, but I think its wrong. I illustrate that with the following snippet of data generator (let's leave the data augmentation for later):

def myGenerator():
    #loading data
    (X_train, y_train), (X_test, y_test) = mnist.load_data()

    #some preprocessing
    y_train = np_utils.to_categorical(y_train,10)
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255
    while 1:
        for i in range(1875):
            if i%125==0:
                print "i = " + str(i)
            yield X_train[i*32:(i+1)*32], y_train[i*32:(i+1)*32]

The above function is called by fit_generator(). Now, it should print till i=1750 and then train the model. However, it just keeps printing i=0 to i=1750 and then starts again from i=0 _without training the model_.

If I comment the line while 1, it runs perfectly, but then it violates the assumption of infinite loop, doesn't it? Can you clear up my confusion by providing a concrete example or talking with respect to this example?

If you want a self-contained code snippet, it is as follows. You can just run it.

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten, Reshape
from keras.layers.convolutional import Convolution1D, Convolution2D, MaxPooling2D
from keras.utils import np_utils


def myGenerator():
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    y_train = np_utils.to_categorical(y_train,10)
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255
    while 1:
        for i in range(1875): # 1875 * 32 = 60000 -> # of training samples
            if i%125==0:
                print "i = " + str(i)
            yield X_train[i*32:(i+1)*32], y_train[i*32:(i+1)*32]

batch_size = 128
nb_classes = 10
nb_epoch = 12

# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
nb_pool = 2
# convolution kernel size
nb_conv = 3
model = Sequential()

model.add(Convolution2D(nb_filters, nb_conv, nb_conv,
                        border_mode='valid',
                        input_shape=(1, img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, nb_conv, nb_conv))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(nb_pool, nb_pool)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adadelta')

model.fit_generator(myGenerator(), samples_per_epoch = 60000, nb_epoch = 2, verbose=2, show_accuracy=True, callbacks=[], validation_data=None, class_weight=None, nb_worker=1)

parag2489 on 3 Feb 2016

👍53 ❤10 🎉9 👎5 👀4 🚀4 😄3 😕2

All 28 comments

Hi there, my answers to your questions below:

Yes that assumption is implicit.
datagen.fit() doesn't _generate_ a batch of examples continuously like a generator; it actually does some once-off computations (mean, std, etc) _given_ the X matrix which you supplied. To continuously generate data with random on-the-fly augmentations, you need to pass the .flow() function of the ImageDataGenerator class to the fit_generator function. That way you will be generating augmented batches _forever_ :)
How big is your data? HDF5 has no size limits afaik so you don't need to worry about storing your data into one huge HDF5 file. Your only issue would be loading it all into memory at a go, which I would advise against since that is what the batch generators are for: avoiding a huge memory hog. Your example looks alright (looping forever is what you want), though you might want to add in some lines for augmenting the data you just loaded.

wongjingping on 3 Feb 2016

👍8

I have some follow up questions @wongjingping :

def myGenerator():
    #loading data
    (X_train, y_train), (X_test, y_test) = mnist.load_data()

    #some preprocessing
    y_train = np_utils.to_categorical(y_train,10)
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255
    while 1:
        for i in range(1875):
            if i%125==0:
                print "i = " + str(i)
            yield X_train[i*32:(i+1)*32], y_train[i*32:(i+1)*32]

If you want a self-contained code snippet, it is as follows. You can just run it.

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten, Reshape
from keras.layers.convolutional import Convolution1D, Convolution2D, MaxPooling2D
from keras.utils import np_utils


def myGenerator():
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    y_train = np_utils.to_categorical(y_train,10)
    X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
    X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255
    while 1:
        for i in range(1875): # 1875 * 32 = 60000 -> # of training samples
            if i%125==0:
                print "i = " + str(i)
            yield X_train[i*32:(i+1)*32], y_train[i*32:(i+1)*32]

batch_size = 128
nb_classes = 10
nb_epoch = 12

# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
nb_pool = 2
# convolution kernel size
nb_conv = 3
model = Sequential()

model.add(Convolution2D(nb_filters, nb_conv, nb_conv,
                        border_mode='valid',
                        input_shape=(1, img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, nb_conv, nb_conv))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(nb_pool, nb_pool)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adadelta')

model.fit_generator(myGenerator(), samples_per_epoch = 60000, nb_epoch = 2, verbose=2, show_accuracy=True, callbacks=[], validation_data=None, class_weight=None, nb_worker=1)

parag2489 on 3 Feb 2016

👍53 ❤10 🎉9 👎5 👀4 🚀4 😄3 😕2

Hi there,

I'm afraid I'm having some problems running your code with the debugger, but from what I can see I think you need to assign the generator instance to a new variable before passing it to the fit_generator() method as below:

my_generator = myGenerator()
model.fit_generator(my_generator, samples_per_epoch = 60000, nb_epoch = 2, verbose=2, show_accuracy=True, callbacks=[], validation_data=None, class_weight=None, nb_worker=1)

Let me know if it doesn't work! (I'm not too sure myself)

wongjingping on 3 Feb 2016

I can run it in pycharm debugger after putting breakpoint inside my_generator() function. However, I don't think running with a debugger is a good idea, in case of generators, that's why I am printing the value of i. With the code I had, you should see i going from 0 to 1750, then immediately warping back and printing 0 to 1750, repeat this _indefinitely_. Ideally, it should print till 1750, train, the again print till 1750, repeat this for NUMBER_OF_EPOCHS times.

Anyway, with your suggestion, same thing happens, it just keeps repeating indefinitely. However, if I remove while 1, it goes till 1750, trains and again goes back to 0 till 1750, trains, then terminates (as desired).

parag2489 on 3 Feb 2016

My apologies, I'm having some difficulty with the ipdb debugger in spyder, and resorted to another workaround.

Your model is training actually. You can add this snippet of code to verify (print out) the progress of your model using a callback:

class printbatch(Callback):
def on_batch_end(self, epoch, logs={}):
print(logs)
...
pb = printbatch()
# modify the fit_generator call to include the callback pb
model.fit_generator(myGenerator(), samples_per_epoch = 60000, nb_epoch = 2,
verbose=2, show_accuracy=True, callbacks=[pb], validation_data=None, class_weight=None, nb_worker=1)

By i ~ 500 you should observe that the training accuracy printed out is at least 90%, with 100% accuracy appearing more frequently.
Without the callback, it is as you observed, i goes from 0 to 1750 and wraps back for the next epoch.
Hope this clarifies your doubts :)

wongjingping on 4 Feb 2016

@wongjingping Thanks. It works. Just one small doubt (rather observation) is that when the logs are printed, the numbers being printed in front of "Batch: " are actually sample numbers. So if there are 60000 samples / epoch, then in the logs being printed through callbacks, we saw "Batch: 59999".

Anyway, I am closing this issue now. I still haven't had success in running the data generators with >1 workers, but I have asked another question for that. You may take a look at it if time permits. That would be great. That is Issue #1638

parag2489 on 4 Feb 2016

Hi @parag2489,

I suspect there is a bug with the fit_generator in determining the batch_size, have raised this under a separate issue #1639. Feel free to chip in!

wongjingping on 4 Feb 2016

@wongjingping @parag2489 Hi~ May I ask you guys how to specify batch size if I wrote my own data generator since fit_generator doesn't have the batch_size parameter and in the fit_generator, we only yield one sample at a time.

sunshineatnoon on 6 Feb 2016

@sunshineatnoon You can pass the batch_size as an argument to the generator:

```def generate_batch(epoch_size,batch_size):

``````
i = 0

while i < epoch_size:

# add in image reading/augmenting code here

yield X[i:i+batch_size,...],y[i:i+batch_size,...]

if i + batch_size > epoch_size:

    i = 0

else:

    i += batch_size```

``````

You might want to check out this link that introduces generators

wongjingping on 9 Feb 2016

@wongjingping Thanks! I will look into this. BTW, what does samples_per_epoch mean in fit_generator exactly? Say if I use a batch size of 64. Does this mean a total of 64*samples_per_epoch is seen every epoch?

sunshineatnoon on 9 Feb 2016

@sunshineatnoon sorry for the poor formatting; the samples_per_epoch is the number of examples you expect to see in an epoch, not batch_size * samples_per_epoch :)

wongjingping on 9 Feb 2016

@wongjingping So it means that if I use a batch size of 64, I will have samples_per_epoch / 64 batches per epoch? But when I specify batch_size and generate a batch, my network training time slows down, it seems like it trains on more samples each epoch if I increase the batch_size. Here is my generator:

def generate_batch_data(vocPath,imageNameFile,batch_size):
    sample_number = 5000 
    class_num = 20

    while 1:

        for i in range(0,sample_number,batch_size):
            #Read a batch of images from files
            imageList = prepareBatch(i,i+batch_size,imageNameFile,vocPath)
            #process imageList to np arrays images and boxes
            yield np.asarray(images),np.asarray(boxes)

sunshineatnoon on 9 Feb 2016

👍4

@sunshineatnoon samples_per_epoch means fit_generator() will stop asking samples from data generator. This is necessary since data generator has an infinite loop and has to be stopped somewhere. In another words, samples_per_epoch = batch_size * number_of_batches.

Regarding why your training slows down, its best to profile your code. There is a feature in Theano for that (I think mode=Profile). You can increase speed if you call prepareBatch() for a large number of samples (large means that they can fit in your cpu RAM but not in GPU). Also, convert images and boxes in numpy array only once. Then just yield in batches of 32. In short, prepareBatch and two calls to np.asarray will go outside the for loop.

parag2489 on 10 Feb 2016

👍4

@parag2489 Thanks! It's very nice of you to give such a detailed explanation, I will try to change my code.

sunshineatnoon on 10 Feb 2016

this page helps,thanks! btw,it seems that new version comes quickly

zzqboy on 1 May 2016

Just to be clear can someone confirm:

In model.fit() you specify the batch_size so it knows how to break a finite data set (the corresponding x and y numpy arrays) into chunks for gradient calculation. 100% of the data set gets consumed each epoch.
With model.fit_generator() the generator you provide should loop infinitely, the as samples_per_epoch is basically giving a bound to total data samples to run through. The batch_size isn't specified as each tuple returned from the generator is a single batch. You control the size of the batch via the generator, so if you return one sample per yields, it's like setting a batch size of 1.

Question:
What the heck is the use for the max_q_size? If the generator is handling the batching, why do you need another queue?

raymondjplante on 2 May 2016

👍20

@raymondjplante

Q1. Your understanding of model.fit() is correct.
Q2. Correct.

Even I am not sure what the max_q_size. I think this answer has a mention of queue. So the queue is used to ensure that the generator is thread-safe.

You can also look at #1638 to see how to make the data generator thread-safe.

parag2489 on 2 May 2016

@parag2489 Someone on SO provides a good explanation of the purpose of the generator queue: http://stackoverflow.com/questions/36986815/in-keras-model-fit-generator-method-what-is-the-generator-queue-controlled-pa

raymondjplante on 2 May 2016

👍1

@wongjingping @parag2489 For the case of multi-inputs, such as we have two pathways in the network, each corresponding to different input, Can we still use "data-generator" to generate image regions parallel with training process?

yanranwang on 8 Nov 2016

👍3

The problem I'm facing is keras fit_generator is good for processing images with collective size more than RAM size,but what if those files are actually not in image format.For example I've taken huge number of images(500k) and have used them against a pre-trained inception v3 model to get the feature out of them.Now each of those files are nothing but (1,384,8,8) array or npy files.Any idea how I can use fit generator to read them in batch as collectively they won't fit in my RAM and generators apparent don't recognize anything other than image files.

tanayz on 4 Dec 2016

👍1

@tanayz It would be the exact same as if they were image instead of pickled/numpy data files:

Get a list of all of the files, and pass this list into the generator
In the generator:
- Infinite loop
- Shuffle the list of files
- For each slice of the shuffled files, where len(slice) == batch_size
- Open files, read to a single array with first shape[0] == batch_size; yield data
- Have an edge case to handle the case where batch_size is not a multiple of the number of files, such that the generator will always yield batch_size number of examples

patyork on 4 Dec 2016

👍11

@seasonwang I'm afraid I haven't tried that out before - sorry for the late reply!

wongjingping on 19 Dec 2016

Is there a way to use train_on_batch with a generator?

behnamprime on 30 May 2017

👍1

you means?

for batch in generator:
    model.train_on_batch(batch)

Dref360 on 8 Jun 2017

👍1

Hi,
I have a question while using "predict_generator". How to ensure that the prediction is done on all test samples once.

For example-
predictions = model.predict_generator(
test_generator,
steps=int(test_generator.samples/float(batch_size)), # all samples once
verbose = 1,
workers = 2,
max_q_size=10,
pickle_safe=True
)
predicted_classes = np.argmax(predictions, axis=1)
true_classes = test_generator.classes

So the dimensions of predicted_classes and true_classes is different since total samples is not divisible by batch size.

The size of my test_set is not consistent, so the no. of steps in predict_generator would change each time depending upon the batch size. I am using flow_from_directory and cannot use predict_on_batch since my data is organized in a directory structure.

One solution is running with batch size of 1, but makes it very slow.

I hope my question is clear. Thanks in advance.

sxs4337 on 14 Jun 2017

The comments and suggestions in this issue and its cousin #1638 were very helpful for me to efficiently process large numbers of images. I wrote it all up in a tutorial fashion that I hope can help others.

https://techblog.appnexus.com/a-keras-multithreaded-dataframe-generator-for-millions-of-image-files-84d3027f6f43

timehaven on 26 Jul 2017

Hello,
I am trying to use model.fit_generator with a custom Callback that tries to access Validation data. However, whatever I do, when accessing validation data from within the Callback, it always equates to None.

class RecallMetrics(Callback):
    def on_train_begin(self, logs=None):
        print('RecallMetrics ... validating')
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []


    def on_epoch_end(self, epoch, logs=None):
        x=(self.validation_data[0])
        if x is None :
            print ('Error: validation_data is None')
            return
        else:
            val_predict = (np.asarray(self.model.predict(self.validation_data[0]))).round()
            val_targ = self.validation_data[1]
            _val_f1 = f1_score(val_targ, val_predict)
            _val_recall = recall_score(val_targ, val_predict)
            _val_precision = precision_score(val_targ, val_predict)
            self.val_f1s.append(_val_f1)
            self.val_recalls.append(_val_recall)
            self.val_precisions.append(_val_precision)
            print (" — val_f1: % f — val_precision: % f — val_recall % f" % (_val_f1, _val_precision, _val_recall))
            return

history = model.fit_generator(generator=train_gen,
                                  validation_data=validate_gen,
                                  # validation_data=None,
                                  steps_per_epoch=len(train_file_list),
                                  validation_steps=len(val_file_list) * 3,
                                  verbose=2,
                                  epochs=int(tc.config["LUNA16"]["epochs"]),
                                  callbacks=callbacks,
                                  workers=multiprocessing.cpu_count(),
                                  use_multiprocessing=True)

How can I access validation data from a custom Callback when using fit_generator?

Best,

deeponcology on 21 Feb 2018

you means?

for batch in generator:
    model.train_on_batch(batch)

Hi,

Tried using this but got the following error:

dloss_real = disc.train_on_batch(dataBatch, valid) File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 1211, in train_on_batch class_weight=class_weight) File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 751, in _standardize_user_data exception_prefix='input') File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training_utils.py", line 92, in standardize_input_data data = [standardize_single_array(x) for x in data] File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training_utils.py", line 92, in <listcomp> data = [standardize_single_array(x) for x in data] File "/home/macman/miniconda3/lib/python3.7/site-packages/keras/engine/training_utils.py", line 27, in standardize_single_array elif x.ndim == 1: AttributeError: 'tuple' object has no attribute 'ndim'