Keras: Feature Request: Add validation_split to fit_generator

Created on 28 Sep 2016  路  27Comments  路  Source: keras-team/keras

I'm fairly new to keras, but it seems to me that the best way to do validation with fit_generator is to create 2 generators for generating training and validation data separately, but this requires some careful orchestrating. I propose to add a validation_split parameter to model.fit_generator that would behave similarly to the same parameter in model.fit.

For example with validation_split=0.1, fit_generator would train from the generator until it has seen (0.9 * samples_per_epoch) samples and then perform validation from the same generator until the end of the epoch.

Most helpful comment

+1, I would really like to see this, and would be willing to implement it.

Wouldn't this be pretty simple?

I'm thinking you could just add params validation_split=0., validation_data=None to ImageDataGenerator.flow(*) and ImageDataGenerator.flow_from_directory(*) the same way we do with Model.flow(*).

Then we could split the train and validation data the same way it's done in Model.fit(*) here: https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L1035.

Then we have separate train and validation datasets and we return a tuple of Iterators as train_generator, validation_generator. Where validation_generator == None if both optional parameters validation_split=0., validation_data=None so backward compatibility is not broken.

All 27 comments

That is only an option if the generator guarantees the same iteration order.

+1, I would really like to see this, and would be willing to implement it.

Wouldn't this be pretty simple?

I'm thinking you could just add params validation_split=0., validation_data=None to ImageDataGenerator.flow(*) and ImageDataGenerator.flow_from_directory(*) the same way we do with Model.flow(*).

Then we could split the train and validation data the same way it's done in Model.fit(*) here: https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L1035.

Then we have separate train and validation datasets and we return a tuple of Iterators as train_generator, validation_generator. Where validation_generator == None if both optional parameters validation_split=0., validation_data=None so backward compatibility is not broken.

+1

+1

+1

+1

+1

+1

+1

+1

+1, but I understand why it hasn't been done yet and I appreciate the attention to not allowing things that shouldn't be done. If the validation data is not a consistently separate dataset you will in the worst case overfit from epoch to epoch and in the best case overfit in a macro sense from training run to training run as you tune.

One idea would be a leger somewhere that specifies which set each file belongs to, that would persist across training sessions. Like a dot file (.kerasdata) in the img/ root next to all the category folders. This is not likely the best implementation idea, but (for everyone upvoting) think about the root problem and contribute a solution idea. It's not as simple as it seems, at least to solve this the right way.

+1

+1

+1

+1

+1

+1

+1

+1

+1
so I would like to tackle this problem. this is very useful features exactly.

I have two approaches.
1. holding validation data's id (maybe hash)
it is similar approaches to @brittohalloran 's idea.
duplicate train_generator(one for train, one for validation), and generates if data belongs train or validation.
As I think, you can just set validation_split to fit_generator.
however, this approch is too slow.

2. make some util which can split train_dir and validation_dir
It is just a simple idea.
make two directories (they have an alias to original_data) with tempfile.

unlike above approach, you should preprocess with train_valid_split.
but, you can customize each generator.
In my imagine, this is usage.

train_dir, val_dir = keras.utils.train_valid_split(original_dir, 0.1)
# all data in train_dir which are alias to original_data.
# and train_dir is a temporary directory.

train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

val_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=(150, 150),
        batch_size=32,
        class_mode='binary')

validation_generator = val_datagen.flow_from_directory(
       val_dir,
        target_size=(150, 150),
        batch_size=32,
        class_mode='binary')

I'd like to get your feedback.

+1

Maybe this issue should be close.
we can use image data generator with validation_split.
https://github.com/keras-team/keras/pull/9745

@kouml @fchollet I dont follow the conversation on https://github.com/keras-team/keras/pull/9745 on newer keras versions.
model.fit_generator() now respects the ImageDataGenerator.validation_split? I'm not observing this in this example https://www.kaggle.com/morenoh149/keras-imagedatagenerator-validation-split

UPDATE: I see from other examples that you should still build a train_generator and a validation_generator which are configured against different directories. Sadly kaggle limits the number of files you can copy into the disk so I'll have to figure out a way to read the data into trian/validation tensors and do the image preprocessing myself.

So to summarize

datagen = ImageDataGenerator(validation_split=0.2, rescale=1. / 255 )
train_gen = datagen.flow_from_directory(
            data,
            target_size=(150, 150, 3),
            subset='training'
    )

val_gen = datagen.flow_from_directory(
            data,
            target_size=(150, 150, 3),
            subset='validation'
    )

model.fit_generator(train_gen, validation_data=val_gen)

Right?
Is there a way to add data augmentations in training data but not in the validation data?

@kampta use another datagen for the validation data generator. See https://github.com/ClaudeCoulombe/deep-learning-with-python-notebooks/blob/master/5.2-using-convnets-with-small-datasets.ipynb # Note that the validation data should not be augmented!

@kampta @morenoh149
It's too late reply, however, I have posted in my repo the way I used to. (https://github.com/kouml/keras-split-utils)
When you have just one hierarchical directory, you can split train dir and valid dir virtually.
you can use different settings for each dataset.

import split_utils
original_dir = './data/'
batch_size = 32
validation_split = 0.1

# all data in train_dir and val_dir which are alias to original_data. (both dir is temporary directory)
# don't clear base_dir, because this directory holds on temp directory.
base_dir, train_dir, val_dir = split_utils.train_valid_split(original_dir, validation_split, seed=1)

# generator for train data
train_datagen = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

train_gen = train_datagen.flow_from_directory(
    train_dir,
    target_size=(28, 28),
    batch_size=batch_size,
    color_mode='grayscale'
)

# generator for validation data
val_datagen = ImageDataGenerator(rescale=1./255)

val_gen = val_datagen.flow_from_directory(
    val_dir,
    target_size=(28, 28),
    batch_size=batch_size,
    color_mode='grayscale'
    )

print('the ratio of validation_split is {}'.format(validation_split))
print('the size of train_dir is {}'.format(train_gen.n))
print('the size of val_dir is {}'.format(val_gen.n))

There is one case where the proposed feature would be very useful: when there is more than enough data, so no data is reused during training.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

amityaffliction picture amityaffliction  路  3Comments

LuCeHe picture LuCeHe  路  3Comments

farizrahman4u picture farizrahman4u  路  3Comments

MarkVdBergh picture MarkVdBergh  路  3Comments

oweingrod picture oweingrod  路  3Comments