Hi.
I have a question about 'validation_split' option in [model.fit]
My question is,
(1) Does 'validation split' option randomly choose validation samples?
(2) If it randomly choose it, does it shuffle them for each epoch ?
(3) If it doesn't shuffle, will 'shuffle' option might work for randomize validation split ?
Please answer me if any one has an idea.
Thank you.
I can't be 110% sure, but take a look at models.py and the following code :
if 0 < validation_split < 1:
do_validation = True
split_at = int(len(ins[0]) * (1 - validation_split))
(ins, val_ins) = (slice_X(ins, 0, split_at), slice_X(ins, split_at))
To me, it seems like if validation_split is 0.1 (edit), it selects the first 90% as train data and the remaining 10% as validation data.
From looking further in models.py, I do believe the train data is then shuffled, but not the validation data
Correct. The validation data is picked as the last 10% (for instance, if validation_split=0.9
) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle
argument in fit
). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.
Thank you for your answers, guys.
By the way, I believe superhans's answer is correct. It splits 10% of train data when we go with "validation_split=0.1" .
Seems like I have to shuffle my valid set on my own. Ahh....
Potentially dumb question: How would I show loss / accuracy on the validation data, after training the model?
There are a few different ways to do this, I think. If your valuation data isn't too big, its very trivial.
# number of validation samples correctly predicted
correct = 0
# you have a trained model. Perform predictions on validation data
predict_dict = graph.predict({'data':val_data}, batch_size=500)
predictions = predict_dict['output']
# For classification, predictions is a n x k vector where k is the number of classes.
# and n is the number of validation samples. Now, convert that into n x 1 vector where
# each element of the vector is the class id
predicted_classes = np_utils.categorical_probas_to_classes(predictions)
# how many samples match the ground truth validation labels ?
correct = np.sum(predicted_classes == val_labels)
# accuracy = number correct / total number
accuracy = correct / (n_samples * 1.0)
Hi can you be a little bit more specific on how the data is separated from the training set for validation? when you say "the last 10%" do you mean that the last 10% of samples are selected? If I have a class-ordered array only the last class samples are selected?
Yes. Quite possibly. You could randomly jumble the array before feeding it.
Edit : normally, data is randomly jumbled.
fit_generator requires to be provided validation data as a fixed set of tuples . Wouldnt it be better if fit_generator had the provision of validation_split like in fit function?
(The fixed validation dataset would not really be representative of the validation loss at each stage as this might not represent the variation of data across the entire dataset )
You can also pass in a validation data generator, but you would have to orchestrate the two generators to work correctly.
I think having a validation split in fit_generator would be doable. You already pass in the samples_per_epoch and an infinite generator. For example if validation_split = 0.1, keep training until you hit samples_per_epoch * 0.9, then just do validation on the rest of the data until the end of the epoch.
@fchollet you mentioned
The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input.
in the FAQ it says
If you set the validation_split argument in model.fit to e.g. 0.1, then the validation data used will be the last 10% of the data.
Is this an update since this post ?
I'm not sure whether my question fits here especially that the original question was asked such a long time ago, but when we say validation data is picked from the last 10%, does it make sure samples are taken from every class or does it just take the last ones? my data is split in folders with class names, but when I read it I put it as a whole to the model.fit().
Also, when talking about shuffling, in my case I extract features of a sequence of 6 frames before feeding them to an LSTM so the idea of shuffling is kind of worrying me since I need the sequence of input to stay in order. Is there anything I need to be careful with?
Thanks a lot !!
@Osumann, As fchollet mentioned in the same post
The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.
So no, the validation data is not necessarily taken from every class and it is just the last 10% (assuming that you ask for 10%) of the data.
The second question definitely doesn't fit here ;)
Ok good stuff. Making sure I understand -- specifying training_split induces only one split (e.g. at the beginning before training), or does it re-split each epoch? Thanks!
@fchollet you mentioned
The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input.
in the FAQ it says
If you set the validation_split argument in model.fit to e.g. 0.1, then the validation data used will be the last 10% of the data.
Is this an update since this post ?
it should be
validation_split=0.1
When you run you'll see something like this
Train on 15837 samples, validate on 6788 samples
this is when i configured the validation_split to be 0.3. It's just a typo from @fchollet I believe.
Most helpful comment
Correct. The validation data is picked as the last 10% (for instance, if
validation_split=0.9
) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle
argument infit
). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.