Keras: Feature Request: Add validation_split to fit_generator

Created on 28 Sep 2016 · 27Comments · Source: keras-team/keras

I'm fairly new to keras, but it seems to me that the best way to do validation with fit_generator is to create 2 generators for generating training and validation data separately, but this requires some careful orchestrating. I propose to add a validation_split parameter to model.fit_generator that would behave similarly to the same parameter in model.fit.

For example with validation_split=0.1, fit_generator would train from the generator until it has seen (0.9 * samples_per_epoch) samples and then perform validation from the same generator until the end of the epoch.

Source

aquintero

👍33

Most helpful comment

+1, I would really like to see this, and would be willing to implement it.

Wouldn't this be pretty simple?

I'm thinking you could just add params validation_split=0., validation_data=None to ImageDataGenerator.flow(*) and ImageDataGenerator.flow_from_directory(*) the same way we do with Model.flow(*).

Then we could split the train and validation data the same way it's done in Model.fit(*) here: https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L1035.

Then we have separate train and validation datasets and we return a tuple of Iterators as train_generator, validation_generator. Where validation_generator == None if both optional parameters validation_split=0., validation_data=None so backward compatibility is not broken.

mjdietzx on 7 Oct 2016

👍13

All 27 comments

That is only an option if the generator guarantees the same iteration order.

RaffEdwardBAH on 28 Sep 2016

👍8

+1, I would really like to see this, and would be willing to implement it.

Wouldn't this be pretty simple?

Then we could split the train and validation data the same way it's done in Model.fit(*) here: https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L1035.

mjdietzx on 7 Oct 2016

👍13

aupilot on 4 May 2017

PGuti on 15 May 2017

Euphemiasama on 26 May 2017

djreiss on 14 Jun 2017

lukas1994 on 19 Jun 2017

mohapatras on 7 Aug 2017

AloshkaD on 13 Aug 2017

msollami on 18 Aug 2017

+1, but I understand why it hasn't been done yet and I appreciate the attention to not allowing things that shouldn't be done. If the validation data is not a consistently separate dataset you will in the worst case overfit from epoch to epoch and in the best case overfit in a macro sense from training run to training run as you tune.

One idea would be a leger somewhere that specifies which set each file belongs to, that would persist across training sessions. Like a dot file (.kerasdata) in the img/ root next to all the category folders. This is not likely the best implementation idea, but (for everyone upvoting) think about the root problem and contribute a solution idea. It's not as simple as it seems, at least to solve this the right way.

brittohalloran on 28 Aug 2017

👍1

nabinn on 20 Oct 2017

gabrielkirsten on 13 Dec 2017

soleaf on 22 Dec 2017

raespanha on 26 Jan 2018

Meirtz on 2 Feb 2018

spate141 on 15 Feb 2018

iamgroot42 on 18 Mar 2018

gabrielkirsten on 18 Mar 2018

+1
so I would like to tackle this problem. this is very useful features exactly.

I have two approaches.
1. holding validation data's id (maybe hash)
it is similar approaches to @brittohalloran 's idea.
duplicate train_generator(one for train, one for validation), and generates if data belongs train or validation.
As I think, you can just set validation_split to fit_generator.
however, this approch is too slow.

2. make some util which can split train_dir and validation_dir
It is just a simple idea.
make two directories (they have an alias to original_data) with tempfile.

unlike above approach, you should preprocess with train_valid_split.
but, you can customize each generator.
In my imagine, this is usage.

train_dir, val_dir = keras.utils.train_valid_split(original_dir, 0.1)
# all data in train_dir which are alias to original_data.
# and train_dir is a temporary directory.

train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

val_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=(150, 150),
        batch_size=32,
        class_mode='binary')

validation_generator = val_datagen.flow_from_directory(
       val_dir,
        target_size=(150, 150),
        batch_size=32,
        class_mode='binary')

I'd like to get your feedback.

kouml on 22 Mar 2018

majiaji on 26 Mar 2018

Maybe this issue should be close.
we can use image data generator with validation_split.
https://github.com/keras-team/keras/pull/9745

kouml on 28 Mar 2018

@kouml @fchollet I dont follow the conversation on https://github.com/keras-team/keras/pull/9745 on newer keras versions.
model.fit_generator() now respects the ImageDataGenerator.validation_split? I'm not observing this in this example https://www.kaggle.com/morenoh149/keras-imagedatagenerator-validation-split

UPDATE: I see from other examples that you should still build a train_generator and a validation_generator which are configured against different directories. Sadly kaggle limits the number of files you can copy into the disk so I'll have to figure out a way to read the data into trian/validation tensors and do the image preprocessing myself.

morenoh149 on 30 Jul 2018

👍1

So to summarize

datagen = ImageDataGenerator(validation_split=0.2, rescale=1. / 255 )
train_gen = datagen.flow_from_directory(
            data,
            target_size=(150, 150, 3),
            subset='training'
    )

val_gen = datagen.flow_from_directory(
            data,
            target_size=(150, 150, 3),
            subset='validation'
    )

model.fit_generator(train_gen, validation_data=val_gen)

Right?
Is there a way to add data augmentations in training data but not in the validation data?

kampta on 12 Sep 2018

👍6

@kampta use another datagen for the validation data generator. See https://github.com/ClaudeCoulombe/deep-learning-with-python-notebooks/blob/master/5.2-using-convnets-with-small-datasets.ipynb # Note that the validation data should not be augmented!

morenoh149 on 12 Sep 2018

@kampta @morenoh149
It's too late reply, however, I have posted in my repo the way I used to. (https://github.com/kouml/keras-split-utils)
When you have just one hierarchical directory, you can split train dir and valid dir virtually.
you can use different settings for each dataset.

import split_utils
original_dir = './data/'
batch_size = 32
validation_split = 0.1

# all data in train_dir and val_dir which are alias to original_data. (both dir is temporary directory)
# don't clear base_dir, because this directory holds on temp directory.
base_dir, train_dir, val_dir = split_utils.train_valid_split(original_dir, validation_split, seed=1)

# generator for train data
train_datagen = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

train_gen = train_datagen.flow_from_directory(
    train_dir,
    target_size=(28, 28),
    batch_size=batch_size,
    color_mode='grayscale'
)

# generator for validation data
val_datagen = ImageDataGenerator(rescale=1./255)

val_gen = val_datagen.flow_from_directory(
    val_dir,
    target_size=(28, 28),
    batch_size=batch_size,
    color_mode='grayscale'
    )

print('the ratio of validation_split is {}'.format(validation_split))
print('the size of train_dir is {}'.format(train_gen.n))
print('the size of val_dir is {}'.format(val_gen.n))

kouml on 22 Jul 2019

There is one case where the proposed feature would be very useful: when there is more than enough data, so no data is reused during training.