Keras: Does 'validation split' randomly choose validation sample?

Created on 25 Aug 2015 · 14Comments · Source: keras-team/keras

Hi.
I have a question about 'validation_split' option in [model.fit]

My question is,

(1) Does 'validation split' option randomly choose validation samples?
(2) If it randomly choose it, does it shuffle them for each epoch ?
(3) If it doesn't shuffle, will 'shuffle' option might work for randomize validation split ?

Please answer me if any one has an idea.

Thank you.

Source

tango4j

👍3

Most helpful comment

Correct. The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input. The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.

fchollet on 25 Aug 2015

👍40 👎10 😄1

All 14 comments

I can't be 110% sure, but take a look at models.py and the following code :

 if 0 < validation_split < 1:
                do_validation = True
                split_at = int(len(ins[0]) * (1 - validation_split))
                (ins, val_ins) = (slice_X(ins, 0, split_at), slice_X(ins, split_at))

To me, it seems like if validation_split is 0.1 (edit), it selects the first 90% as train data and the remaining 10% as validation data.

From looking further in models.py, I do believe the train data is then shuffled, but not the validation data

superhans on 25 Aug 2015

👍15 🎉4 😄3

fchollet on 25 Aug 2015

👍40 👎10 😄1

Thank you for your answers, guys.
By the way, I believe superhans's answer is correct. It splits 10% of train data when we go with "validation_split=0.1" .
Seems like I have to shuffle my valid set on my own. Ahh....

tango4j on 25 Aug 2015

👍2 😄1

Potentially dumb question: How would I show loss / accuracy on the validation data, after training the model?

zachmayer on 23 Oct 2015

👍1

There are a few different ways to do this, I think. If your valuation data isn't too big, its very trivial.

# number of validation samples correctly predicted
correct = 0 

# you have a trained model. Perform predictions on validation data 
predict_dict = graph.predict({'data':val_data}, batch_size=500) 
predictions = predict_dict['output']

# For classification, predictions is a n x k vector where k is the number of classes.
# and n is the number of validation samples. Now, convert that into n x 1 vector where
# each element of the vector is the class id
predicted_classes = np_utils.categorical_probas_to_classes(predictions)             

# how many samples match the ground truth validation labels ?
correct = np.sum(predicted_classes == val_labels)

# accuracy = number correct / total number
accuracy = correct / (n_samples * 1.0)

superhans on 23 Oct 2015

Hi can you be a little bit more specific on how the data is separated from the training set for validation? when you say "the last 10%" do you mean that the last 10% of samples are selected? If I have a class-ordered array only the last class samples are selected?

omarcr on 24 Oct 2015

Yes. Quite possibly. You could randomly jumble the array before feeding it.

Edit : normally, data is randomly jumbled.

superhans on 25 Oct 2015

fit_generator requires to be provided validation data as a fixed set of tuples . Wouldnt it be better if fit_generator had the provision of validation_split like in fit function?
(The fixed validation dataset would not really be representative of the validation loss at each stage as this might not represent the variation of data across the entire dataset )

NeerajSajjan on 18 Sep 2016

👍13

You can also pass in a validation data generator, but you would have to orchestrate the two generators to work correctly.

I think having a validation split in fit_generator would be doable. You already pass in the samples_per_epoch and an infinite generator. For example if validation_split = 0.1, keep training until you hit samples_per_epoch * 0.9, then just do validation on the rest of the data until the end of the epoch.

aquintero on 28 Sep 2016

👍2

@fchollet you mentioned

The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input.

in the FAQ it says

If you set the validation_split argument in model.fit to e.g. 0.1, then the validation data used will be the last 10% of the data.

Is this an update since this post ?

sharifza on 2 Sep 2017

👍3

I'm not sure whether my question fits here especially that the original question was asked such a long time ago, but when we say validation data is picked from the last 10%, does it make sure samples are taken from every class or does it just take the last ones? my data is split in folders with class names, but when I read it I put it as a whole to the model.fit().
Also, when talking about shuffling, in my case I extract features of a sequence of 6 frames before feeding them to an LSTM so the idea of shuffling is kind of worrying me since I need the sequence of input to stay in order. Is there anything I need to be careful with?

Thanks a lot !!

Wazaki-Ou on 4 Jun 2018

@Osumann, As fchollet mentioned in the same post

The training data (the remainder) can optionally be shuffled at every epoch (shuffle argument in fit). That doesn't affect the validation data, obviously, it has to be the same set from epoch to epoch.

So no, the validation data is not necessarily taken from every class and it is just the last 10% (assuming that you ask for 10%) of the data.
The second question definitely doesn't fit here ;)

sharifza on 5 Jun 2018

👍1

Ok good stuff. Making sure I understand -- specifying training_split induces only one split (e.g. at the beginning before training), or does it re-split each epoch? Thanks!

paulmattheww on 18 Jul 2018

@fchollet you mentioned

The validation data is picked as the last 10% (for instance, if validation_split=0.9) of the input.

in the FAQ it says

If you set the validation_split argument in model.fit to e.g. 0.1, then the validation data used will be the last 10% of the data.

Is this an update since this post ?

it should be