Keras: For large datasets, which to use: fit or train_on_batch?

Created on 12 May 2016 · 24Comments · Source: keras-team/keras

I have a large dataset that does not fit into memory. I've coded a custom class that yields ~10K images + labels at a time. Looking at the Keras documentation, I see that train_on_batch is recommended.

However, in #68, I see that @fchollet is using fit. Right now, my code looks like:

for epoch in np.arange(0, conf["epochs"]):
    print("[PARENT EPOCH] epoch {}...".format(epoch + 1))
    for (images, labels) in trainDG.nextBatch():
        model.fit(
            images, labels,
            batch_size=conf["batch_size"],
            nb_epoch=1,
            verbose=conf["verbose"])

This is to replicate the behavior that fchollet recommended in #68. However, I'm wondering if I should instead be using train_on_batch like the documentation recommends:

for epoch in np.arange(0, conf["epochs"]):
    print("[PARENT EPOCH] epoch {}...".format(epoch + 1))
    for (images, labels) in trainDG.nextBatch():
        model.train_on_batch(images, labels)

Which function, fit or train_on_batch is more appropriate for this situation?

Source

jrosebr1

👍32

Most helpful comment

I think the best way to fit large data which can not fit into memory is writing a customize generator. It yields (images, labels) with samples in size of batch. You can design your own loading mechanism to better fit with your memory requirement and use fit_generator like the example.

def generate_arrays_from_file(path):
    while 1:
        f = open(path)
        for line in f:
            # create numpy arrays of input data
            # and labels, from each line in the file
            x, y = process_line(line)
            img = load_images(x)
            yield (img, y)
        f.close()

model.fit_generator(generate_arrays_from_file('/my_file.txt'),
        samples_per_epoch=10000, nb_epoch=10)

joelthchao on 12 May 2016

👍120 ❤25

All 24 comments

def generate_arrays_from_file(path):
    while 1:
        f = open(path)
        for line in f:
            # create numpy arrays of input data
            # and labels, from each line in the file
            x, y = process_line(line)
            img = load_images(x)
            yield (img, y)
        f.close()

model.fit_generator(generate_arrays_from_file('/my_file.txt'),
        samples_per_epoch=10000, nb_epoch=10)

joelthchao on 12 May 2016

👍120 ❤25

Thanks for the response @joelthchao. I already have written a custom generator. It accesses images + labels that have been serialized to a HDF5 dataset. I can easily modify it to return single images as well, that's not the issue.

My question is whether fit, train_on_batch, and as you suggested, fit_generator is the correct function to use here. The reason I wouldn't use fit_generator is because my validation data doesn't fit in memory either, so I have custom code that is used to determine accuracy/loss on the validation data as well.

jrosebr1 on 12 May 2016

👍1

With fit_generator, you can use a generator for the validation data as
well. In general I would recommend using fit_generator, but using
train_on_batch works fine too. These methods only exist as for the sake of
convenience in different use cases, there is no "correct" method.

On 12 May 2016 at 08:54, Adrian Rosebrock [email protected] wrote:

Thanks for the response @joelthchao https://github.com/joelthchao. I
already have written a custom generator. It accesses images + labels that
have been serialized to a HDF5 dataset. I can easily modify it to return
single images as well, that's not the issue.

My question is whether fit, train_on_batch, and as you suggested,
fit_generator is the correct function to use here. The reason I wouldn't
use fit_generator is because my validation data doesn't fit in memory
either, so I have custom code that is used to determine accuracy/loss on
the validation data as well.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/2708#issuecomment-218801593

fchollet on 12 May 2016

👍23

Thanks for the clarification @fchollet! I didn't realize that train_on_batch was simply a convenience function.

jrosebr1 on 12 May 2016

I am confused. If fit_generator does not require batch_size, then how is the shape of the input layer determined?

jadielam on 27 Jun 2016

👍13

As @jadielam we too have problems using the batch size in the generator: we read X and y from two different CSV files, and - as per example in Keras docs - we yield one sample in the generator, but then get an exception in fit_generator: Exception: Error when checking model input: expected dense_input_2 to have shape (None, 19) but got array with shape (19, 1).

My guess is that fit_generator asks one sample at a time to the generator (batch size = 1).

Our generator's code:

def getData(X_path, y_path): while 1: with open(X_path, "rb") as csv1, open(y_path, "rb") as csv2: reader1 = csv.reader(csv1, delimiter=',') reader2 = csv.reader(csv2, delimiter=',') for row in zip(reader1, reader2): yield (np.array(row[0], dtype=np.float), np.array(row[1], dtype=np.float)) csv1.close() csv2.close()

dmdigital on 26 Jul 2016

Hey @dmdigital, just to make sure: have you called keras.backend.set_image_dim_ordering? If you're using theano it should be called with "th" or with tensorflow "tf". I say this because I got the same error you got before calling this function.

To more directly answer the question, I believe (when working) fit_generator uses (eg) len(generator.next()) to determine the batch size.

CosineP on 6 Dec 2016

@CosineP Thanks for your answer. No, we don't use keras.backend.set_image_dim_ordering.

dmdigital on 6 Dec 2016

@joelthchao commented on May 12, 2016, 4:51 PM GMT+2:

I think the best way to fit large data which can not fit into memory is writing a customize generator. It yields (images, labels) with samples in size of batch. You can design your own loading mechanism to better fit with your memory requirement and use fit_generator like the example.
def generate_arrays_from_file(path):
    while 1:
        f = open(path)
        for line in f:
            # create numpy arrays of input data
            # and labels, from each line in the file
            x, y = process_line(line)
            img = load_images(x)
            yield (img, y)
        f.close()

model.fit_generator(generate_arrays_from_file('/my_file.txt'),
        samples_per_epoch=10000, nb_epoch=10)

This seems at a glance too inefficient as it needs to read every image individually from disk (even though a 'batch' fits in memory), every time.

I'm curious how it would work with a generator for data augmentation. Something like generator composition? Or constructing an ImageGenerator in the custom generator and somehow requesting new images from it and yielding those?

munael on 15 Apr 2017

@Enamex In current keras version, ImageDataGenerator has flow_from_directory to use, which is quite convenience. It's a combination of iterator and data augmentation.

joelthchao on 15 Apr 2017

👍3

Can someone point me to a complete example that does all of the following?

Fits batched (and pickled) data in a loop using train_on_batch()
Sets aside data from each batch for validation purposes
Sets aside test data for accuracy evaluation after all batches have been processed (see last line of my example below).

I'm finding lots of 1 - 5 line code snippets on the internet illustrating how to call train_on_batch() or fit_generator(), but so far nothing that clearly illustrates how to separate out and handle both validation and test data while using train_on_batch().

F. Chollet's great example Cifar10_cnn (https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py) does not illustrate all of the points I listed above.

You can say, "Hey handling test data is your problem. Do it manually." Fine! But I don't understand what these routines do well enough to even know if that is necessary. They are mostly black boxes, and for all I know, they handle validation & test data automagically under the hood. My hope is that more complete example would clear up the confusion.

For instance, in the example below where I read batches iteratively from pickle files, how would I modify the call to train_on_batch to handle validation_data? How do I set aside test data (test_x & test_y) for purposes of evaluating accuracy at the end of the algorithm?

while 1:
    try:
        batch = np.array(pickle.load(fvecs))
        polarities = np.array(pickle.load(fpols)) 

        # Divide a batch of 1000 documents (movie reviews) into:
        # 800 rows of training data, and
        # 200 rows of test (validation?) data
        train_x, val_x, train_y, val_y = train_test_split(batch, polarities, test_size=0.2)

        doc_size = 30
        x_batch = pad_sequences(train_x, maxlen=doc_size)
        y_batch = train_y

        # Fit the model 
        model.train_on_batch(x_batch, y_batch)
        # model.fit(train_x, train_y, validation_data=(val_x, val_y), epochs=2, batch_size=800, verbose=2)

    except EOFError:
        print("EOF detected.")
        break

# Final evaluation of the model
scores = model.evaluate(test_x, test_y, verbose=0)
print("Accuracy: %.2f%%" % (scores[1] * 100))

pluviosilla on 28 Oct 2017

👍2

@pluviosilla Would you not call model.predict on validation data? And use the model's predictions versus the validation outputs to determine accuracy?

LRonHubs on 30 Oct 2017

Train_on_batch seems to be a lot more convenient than fit_generator, but I can't get my model to learn using it. It is probably a data problem, but it would be nice to see have a working example to compare to in order to confirm it's certainly that.

LRonHubs on 30 Oct 2017

@LRonHubs, I probably misconstrue all the terms here, but when I hear "validation data", I think of "cross validation", the procedure used to tweak and tune model parameters. That is NOT test data and I got the idea somewhere that the two should not be mixed.

So let me put it in the form of a question: what does the Keras community mean by "validation data"? Cross validation data for purposes of parameter tuning, or test data for purposes of measuring accuracy?

If the latter, I don't understand why it would be acceptable to test accuracy at the batch level while you are still fitting the model.

pluviosilla on 30 Oct 2017

👍5

numpy has a memmap function which allows you to use data from a file as if it were a regular nd array, but only loads in the necessary chunks of the file to memory. I am wondering whether fit should be used with this method or whether I should still do a fit_generator? Does anyone have an opinion? Thanks.

mas-dse-greina on 6 Nov 2017

👍3

Does train_on_batch() loads the model and mini batch on GPU and after one gradient update, remove it from GPU? in other words, if I use train_on_batch(), would it keep loading and unloading the model on GPU for each iteration?

HarisIqbal88 on 20 Mar 2018

👍4

hellojialee on 16 Apr 2018

@pluviosilla did you ever find any good examples? I am having a similar problem. I've designed a bucket generator function where each bucket is of batch size N and each batch has input length T. In other words my batch sizes and time-steps vary from bucket to bucket. Thus, when using model.fit the number of time steps inferred by the batch size of the particular bucket. But since train_on_batch takes no such argument there seems to be no way for me to use it.

I assume train_on_batch is similar to training on a single given batch for a single epoch. Using both the fit and alternatively the train_on_batch methods I got different loss outcomes which should be obvious given the lack of the batch length argument. I'm going to have to carry on with fit but if anyone has any ideas on how to adapt the train_on_batch methods I'd love to hear them. This has been another helpful post: https://github.com/keras-team/keras/issues/2539

ghost on 28 Apr 2018

👍2

Hello how are you? I apologize for the inconvenience. Could you help me. I'm trying to create my own generator from the above comments. However, when I apply model.fit_generator, I realize that my network does not use batch_size. For example, if I have 32676 images and batch_size of 64, I should realize 510 iterations per epoch. But my network has 32676 iterations per epoch. My dataset is large and with two channel images, so I need to create my own generator. I can not use the commands ImageDataGenerator, flow_from_directory and model.fit_generator direct from keras, because my images have two channels and these commands only work with 1 and 3 channel images. Would it be possible for you to help me?

I also did a generator for validation. That's why I use validationGenerator ().

I send my own generator to you:

  ######################## Generator ##################################

      def trainingGenerator():
            train_Class1_dir='/media/HD500/RGB_MIN/train/Class1'
            train_Class2_dir='/media/HD500/RGB_MIN/train/Class2'

############################ Class1 ###############################
            X_trainP = []
            trainP_ids = next(os.walk(train_Class1_dir))[2]
            for n, id_ in tqdm(enumerate(trainP_ids), total=len(trainP_ids)):
                  treinamento = train_Class1_dir + '/' + id_
                  X_trainP.append(treinamento)
            Y_trainP = np.ones((len(X_trainP), 1), dtype=np.uint8)
############################ Class 2 ###########################
            X_trainPN = []
            trainPN_ids = next(os.walk(train_Class2_dir))[2]
            for n, id_ in tqdm(enumerate(trainPN_ids), total=len(trainPN_ids)):
                  treinamento = train_Class2_dir + '/' + id_
                  X_trainPN.append(treinamento)
            Y_trainPN = np.zeros((len(X_trainPN), 1), dtype=np.uint8)
############ Dataset of Train ########################
            X_trainFinal = X_trainP + X_trainPN
            Y_train = np.concatenate((Y_trainP,Y_trainPN),axis=0)
            num_classes = np.unique(Y_train).shape[0]
            Y_train = np_utils.to_categorical(Y_train, num_classes) # One-hot encode the labels

 ########################### Image #############################
           img_width, img_height, img_channels = 227, 227, 4
           X_train = np.zeros((len(X_trainFinal), img_width, img_height, img_channels), dtype=np.uint8)
           for n, path1 in tqdm(enumerate(X_trainFinal), total=len(X_trainFinal)):
                   path = path1
                   img = imageio.imread(path)[:,:,:img_channels]
                   img = resize(img, (img_height, img_width), mode='constant', preserve_range=True)
                   X_train[n] = img

           batch_size=64
           X_train = X_train.astype('float32')
           X_train /255
           while 1:
                    for i in range(len(X_train)//batch_size):
                          yield X_train[i*batch_size:(i+1)*batch_size], Y_train[i*batch_size:(i+1)*batch_size]

  MyTrainingGenerator = trainingGenerator()
  MyValidationGenerator = validationGenerator()

  Results_Train = model.fit_generator(MyTrainingGenerator,
                    steps_per_epoch=nb_train_samples // batch_size,
                    epochs=num_epochs,
                    validation_data=MyValidationGenerator, 
                    validation_steps = nb_validation_samples // batch_size,
                    callbacks=[History, checkpointer, csv_logger],
                    verbose=1)

gledsonmelotti on 9 May 2018

@pluviosilla did you ever find any good examples? I am having a similar problem.

guidomocha on 18 Jan 2019

👍1

@guidomocha I ended up writing a tutorial on .fit vs. .fit_batch, including how to write your own custom Keras data generators:

https://www.pyimagesearch.com/2018/12/24/how-to-use-keras-fit-and-fit_generator-a-hands-on-tutorial/

I hope you find it helpful!

jrosebr1 on 18 Jan 2019

❤6

thank you @jrosebr1 ! I read your tutorial, but I still have one big question on this. Do you know if optimizers (Adam, adagrad, etc.) correctly update learning rates when using train_on_batch from one call to the next? Do the optimizers change step sizes the same way they would calling fit? I've had a lot of stability issues in the past when using train_on_batch, which seems to be the required way of using keras for reinforcement learning applications where you need custom control over the both forward and backward passes, and the solution has always been to turn down the initial learning rate, even when the instability may not happen for several million timesteps. Or else people seem to build in some clunky explicit learning rate decay methods. It just seems like something may not be quite right with the behavior of these optimizers when used with train_on_batch and I was curious if you might have any insight there.

I'm currently trying to decide which of these two routes to take with a non-RL project now, but for various reasons, I'm stuck working in Windows for this one, and I know per #10842 and stackoverflow that multiprocessing may have issues with fit_generator in Windows, which has me leaning more towards a solution with train_on_batch and a rabbitmq queuing system in order to pre-fetch the next batch while the current one trains. fit_generator would be much simpler if I could just hop over to Linux, but my db is in Windows, and migrating it over or putting it on another machine is a process I'd rather avoid, especially since it's quite large.

M00NSH0T on 6 Feb 2019

@M00NSH0T I'm not sure off the top of my head but I don't see why the adaptive optimizers wouldn't work correctly. That said, I do very little work in reinforcement learning so it's very likely I've never experienced the problem you are describing. I also haven't used Windows in a good 10+ years at this point so I unfortunately cannot comment on that either.

If you're going to be using a message passing approach I would suggest either RabbitMQ or ZeroMQ and then build your producer/consumer relationship around them. I'm not sure how your machine, DB, network, etc. is setup but you _could_ run into latency issues where the queuing cannot keep up with the network needing training data. Disk space is cheap so you might want to consider two queues:

One that uses RabbitMQ or ZeroMQ to fetch the data for _M_ batches and writes batches to disk
A custom fit_generator that yields _N_ batches from the disk where M >> N

I hope that helps!

jrosebr1 on 17 Feb 2019

 x, y = process_line(line)
            img = load_images(x)
            yield (img, y)
        f.close()

Surely you return those x and y values right?

ISimplifyComplexity on 8 May 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Regularizer config does not serialize to YAML

nryant · 3Comments

New to Keras, how to format image data in numpy arrays for training?

oweingrod · 3Comments

Model with Dropout layer wrapped in TimeDistributed fails on Theano

somewacko · 3Comments

Can we define each time step of a RNN with different length?

NancyZxll · 3Comments

New predict API for multiple outputs

snakeztc · 3Comments