I'm fairly new to keras, but it seems to me that the best way to do validation with fit_generator is to create 2 generators for generating training and validation data separately, but this requires some careful orchestrating. I propose to add a validation_split parameter to model.fit_generator that would behave similarly to the same parameter in model.fit.
For example with validation_split=0.1, fit_generator would train from the generator until it has seen (0.9 * samples_per_epoch) samples and then perform validation from the same generator until the end of the epoch.
That is only an option if the generator guarantees the same iteration order.
+1, I would really like to see this, and would be willing to implement it.
Wouldn't this be pretty simple?
I'm thinking you could just add params validation_split=0., validation_data=None
to ImageDataGenerator.flow(*) and ImageDataGenerator.flow_from_directory(*)
the same way we do with Model.flow(*)
.
Then we could split the train and validation data the same way it's done in Model.fit(*)
here: https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L1035.
Then we have separate train and validation datasets and we return a tuple of Iterators as train_generator, validation_generator
. Where validation_generator == None
if both optional parameters validation_split=0., validation_data=None
so backward compatibility is not broken.
+1
+1
+1
+1
+1
+1
+1
+1
+1, but I understand why it hasn't been done yet and I appreciate the attention to not allowing things that shouldn't be done. If the validation data is not a consistently separate dataset you will in the worst case overfit from epoch to epoch and in the best case overfit in a macro sense from training run to training run as you tune.
One idea would be a leger somewhere that specifies which set each file belongs to, that would persist across training sessions. Like a dot file (.kerasdata) in the img/ root next to all the category folders. This is not likely the best implementation idea, but (for everyone upvoting) think about the root problem and contribute a solution idea. It's not as simple as it seems, at least to solve this the right way.
+1
+1
+1
+1
+1
+1
+1
+1
+1
so I would like to tackle this problem. this is very useful features exactly.
I have two approaches.
1. holding validation data's id (maybe hash)
it is similar approaches to @brittohalloran 's idea.
duplicate train_generator(one for train, one for validation), and generates if data belongs train or validation.
As I think, you can just set validation_split
to fit_generator.
however, this approch is too slow.
2. make some util which can split train_dir and validation_dir
It is just a simple idea.
make two directories (they have an alias to original_data) with tempfile
.
unlike above approach, you should preprocess with train_valid_split
.
but, you can customize each generator.
In my imagine, this is usage.
train_dir, val_dir = keras.utils.train_valid_split(original_dir, 0.1)
# all data in train_dir which are alias to original_data.
# and train_dir is a temporary directory.
train_datagen = ImageDataGenerator(
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
val_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
validation_generator = val_datagen.flow_from_directory(
val_dir,
target_size=(150, 150),
batch_size=32,
class_mode='binary')
I'd like to get your feedback.
+1
Maybe this issue should be close.
we can use image data generator with validation_split.
https://github.com/keras-team/keras/pull/9745
@kouml @fchollet I dont follow the conversation on https://github.com/keras-team/keras/pull/9745 on newer keras versions.
model.fit_generator() now respects the ImageDataGenerator.validation_split? I'm not observing this in this example https://www.kaggle.com/morenoh149/keras-imagedatagenerator-validation-split
UPDATE: I see from other examples that you should still build a train_generator and a validation_generator which are configured against different directories. Sadly kaggle limits the number of files you can copy into the disk so I'll have to figure out a way to read the data into trian/validation tensors and do the image preprocessing myself.
So to summarize
datagen = ImageDataGenerator(validation_split=0.2, rescale=1. / 255 )
train_gen = datagen.flow_from_directory(
data,
target_size=(150, 150, 3),
subset='training'
)
val_gen = datagen.flow_from_directory(
data,
target_size=(150, 150, 3),
subset='validation'
)
model.fit_generator(train_gen, validation_data=val_gen)
Right?
Is there a way to add data augmentations in training data but not in the validation data?
@kampta use another datagen for the validation data generator. See https://github.com/ClaudeCoulombe/deep-learning-with-python-notebooks/blob/master/5.2-using-convnets-with-small-datasets.ipynb # Note that the validation data should not be augmented!
@kampta @morenoh149
It's too late reply, however, I have posted in my repo the way I used to. (https://github.com/kouml/keras-split-utils)
When you have just one hierarchical directory, you can split train dir and valid dir virtually.
you can use different settings for each dataset.
import split_utils
original_dir = './data/'
batch_size = 32
validation_split = 0.1
# all data in train_dir and val_dir which are alias to original_data. (both dir is temporary directory)
# don't clear base_dir, because this directory holds on temp directory.
base_dir, train_dir, val_dir = split_utils.train_valid_split(original_dir, validation_split, seed=1)
# generator for train data
train_datagen = ImageDataGenerator(
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
train_gen = train_datagen.flow_from_directory(
train_dir,
target_size=(28, 28),
batch_size=batch_size,
color_mode='grayscale'
)
# generator for validation data
val_datagen = ImageDataGenerator(rescale=1./255)
val_gen = val_datagen.flow_from_directory(
val_dir,
target_size=(28, 28),
batch_size=batch_size,
color_mode='grayscale'
)
print('the ratio of validation_split is {}'.format(validation_split))
print('the size of train_dir is {}'.format(train_gen.n))
print('the size of val_dir is {}'.format(val_gen.n))
There is one case where the proposed feature would be very useful: when there is more than enough data, so no data is reused during training.
Most helpful comment
+1, I would really like to see this, and would be willing to implement it.
Wouldn't this be pretty simple?
I'm thinking you could just add params
validation_split=0., validation_data=None
to ImageDataGenerator.flow(*) andImageDataGenerator.flow_from_directory(*)
the same way we do withModel.flow(*)
.Then we could split the train and validation data the same way it's done in
Model.fit(*)
here: https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L1035.Then we have separate train and validation datasets and we return a tuple of Iterators as
train_generator, validation_generator
. Wherevalidation_generator == None
if both optional parametersvalidation_split=0., validation_data=None
so backward compatibility is not broken.