Keras: loss and val_loss with validation_data

Created on 27 Aug 2015 · 19Comments · Source: keras-team/keras

Hello all,

I have tried to look in Keras models.py but I'm not experienced enough to answer this question.
I'm not sure if it's the expected behaviour and if I understood well how to use validation_data but I get different results for loss and val_loss using the same data.

Train on 100000 samples, validate on 100000 samples
Epoch 0
100000/100000 [==============================] - 144s - loss: 0.8586 - val_loss: 0.8122
Epoch 1
100000/100000 [==============================] - 149s - loss: 0.8168 - val_loss: 0.8112
Epoch 2
100000/100000 [==============================] - 149s - loss: 0.8095 - val_loss: 0.8107
Epoch 3
100000/100000 [==============================] - 149s - loss: 0.8032 - val_loss: 0.8102
Epoch 4
100000/100000 [==============================] - 151s - loss: 0.8001 - val_loss: 0.8069
Epoch 5
100000/100000 [==============================] - 150s - loss: 0.7941 - val_loss: 0.8022
Epoch 6
100000/100000 [==============================] - 144s - loss: 0.7898 - val_loss: 0.7992
Epoch 7
100000/100000 [==============================] - 145s - loss: 0.7872 - val_loss: 0.8005

Here is my code:

X_1 = pd.read_pickle("X_1.pkl")
X_2 = pd.read_pickle("X_2.pkl")
y_train = pd.read_pickle("y_train.pkl")

left = Sequential()
left.add(Embedding(len(X_1), 64))
left.add(LSTM(64, 64,
              forget_bias_init='one'))
left.add(Dropout(0.1))
left.add(Dense(64, 64))

right = Sequential()
right.add(Masking(mask_value=0.))
right.add(LSTM(X_train.shape[2], 64, 
               forget_bias_init='one',
               return_sequences=False))
right.add(Dropout(0.1))
right.add(Dense(64,64))

model = Sequential()
model.add(Merge([left, right], mode='sum'))

model.add(Dense(64, 1, W_regularizer=l1l2(0.01)))
model.add(ParametricSoftplus(1))

model.compile(loss="poisson_loss", optimizer='rmsprop')

components = model.fit([X_1, X_2], y_train, batch_size=128,
                       nb_epoch=20, validation_data=([X_1, X_2], y_train),
                       shuffle=False)

Thank you!

Source

tboquet

❤5

Most helpful comment

The other reason that the results are different is because the model is being trained while the "loss" is being computed, whereas the model is fixed while "val_loss" is being computed. Since the model is training, "loss" is typically going to be larger than the true training set loss at the end of the epoch. I.e. "loss" is the average loss during the epoch, and "val_loss" is the average loss after the end of the epoch. Since the model changes during the epoch, the loss changes.

dhammack on 27 Aug 2015

👍24

All 19 comments

I believe it is the dropout. During training (loss) dropout is on. For validation (val_loss) dropout is off.
Dropout basically shuts down a few weights of the network randomly. Think of it like a system where part of it fails regularly. The other parts have to learn how to do their best, even if they can't count with their peers. During validation the FULL system is on. This helps you deal with overfitting.

Try training your model without dropout to test this hypothesis. Let me know if I'm wrong.

EderSantana on 27 Aug 2015

👍5

Thanks for your help @EderSantana, unfortunately I'm testing it right now and I still have a difference between the loss and the validation loss.
(I thought the dropout was not used for a forward pass where we only use the learnt weights to predict the outputs right?)

tboquet on 27 Aug 2015

dhammack on 27 Aug 2015

👍24

Besides all these, the two loss values are being computed on two distinct sets of samples. So you can't expect them to be equal. In fact the difference between the training loss and the validation/test loss can give great insight into your model complexity for the problem in hand and also the training progress of your model (overfitting, underfitting).

erfannoury on 27 Aug 2015

👎14 👍2

Thanks @dhammack it was my next question.
I was confused because I implemented my first neural nets in Theano and I was looking at the loss at the end of the epoch for both the training and the validation set (with the same data). I had the exact same results.
I'm graphing these losses and I'm not sure if we can compare these 2 metrics?

tboquet on 27 Aug 2015

@erfannoury I'm using the same data that's why I was expecting the same results.

model.fit([X_1, X_2], y_train, batch_size=128,
                       nb_epoch=20, validation_data=([X_1, X_2], y_train),
                       shuffle=False)

tboquet on 27 Aug 2015

👍2

@tboquet I didn't notice.
Well it's usually not a correct practice to use training data for validation.
Anyways, the issue you have found is something else.
Maybe the loss displayed for the training data is the loss for the last batch of data, while the loss for validation data is the average loss for the whole validation set. It's just a guess, to be sure I should take a look at the code.

erfannoury on 27 Aug 2015

👎6

@erfannoury I didn't want to use training data for validation, I wanted to diagnose my network with a real test data set but I saw that I had my training loss always greater than my validation loss. Investigating further the saved value in the history doesn't seem to be the loss evaluated with the last weights but an average of every minibatch (a lower bound).
I'm not sure of the average and I'm still looking in the code to figure out if it is the case.
I'm just wondering if we really want that to be the default training loss saved. It's still possible to implement a custom callback and recalculate this loss and I will try this to compare this to the saved training loss :).

tboquet on 27 Aug 2015

Good find @tboquet, I just got confused by this myself. So if we want to evaluate the loss on the training set at the end of every epoch we need to call model.evaluate(...) with the training set in the on_epoch_end callback?

carlthome on 24 Mar 2016

Yep you should add another history callbach and evaluate the model at the end of each epoch! It may take some time though because you will have to go through all your minibatches. I could be a good idea to sample your training data and pass it to your callback.

tboquet on 25 Mar 2016

I am seeing the same issue. It seems that the answer is the training loss is computed as an average for all the minibatches and the validation loss is computed on the whole set. Is this correct?

isaacgerg on 3 Feb 2017

As it is stated in https://keras.io/getting-started/faq/#why-is-the-training-loss-much-higher-than-the-testing-loss, during training (loss) dropout is on while for validation (val_loss) dropout is off. Also, the training loss is computed as an average for all the minibatches and the validation loss is computed on the whole set.

curiale on 26 Feb 2017

👍6

@curiale You are correct. However, in the model I was using, there was no dropout. I discovered that the problem was related to the EMA in batch normalization -- it was set to the default which was way to high for my application.

isaacgerg on 27 Feb 2017

👍2

Hi, @isaacgerg. The training loss is computed as an average for all the minibatches and the validation loss is computed on the whole set.

curiale on 8 May 2017

@curiale thank you.

isaacgerg on 8 May 2017

Is there any way to display the batch-wise loss during training, instead of a cumulative average? This can cause a lot of confusion: let's say you start with a loss of around 4.0 on your first batches. Then you see the loss dropping slowly to 3.0, 2.0, etc... then there is no way to now if the net is actually learning slowly, or if you just witnessed a very abrupt decreasing loss at some point, since you're seeing an average of all your batches loss.

Here is an example where you see similar averages on Keras, but the behaviour of your net is completely different (obviously the right one is very overdone but you see my point)

StripedBanana on 23 May 2017

Thanks @isaacgerg! In my experiment, I tried to overfit only 2 samples but observed a huge difference between training and validation loss (on the same 2 samples). Inspired by @isaacgerg, I disabled the batchnorm layers and it worked!

fwtan on 19 Aug 2017

@fwtan Your welcome. Regarding the batchnorm layers, I find I have to adjust the averaging coefficients to get good results. The current coefs don't average enough for me. I use something around 0.6.

isaacgerg on 21 Aug 2017

Hi Guys,
I have faced a weird problem. My training loss for a a specific data set is always higher than my validation set. It does not matter how many dataset I want to train. For example, I tested it just for 2 training data set and 3 iteration and I expected that it over fits but strangely it did not happen. I also selected the same validation and training set but the loss value was different. any help is appreciated

       model = Sequential()
        model.add(LSTM(60, input_shape=(train_X.shape[1], train_X.shape[2]),return_sequences=False))
        model.add(Dense(1,activation='sigmoid'))
        sgd = optimizers.SGD(lr=0.01, decay=1e-2, momentum=0.9)
        model.compile(loss='mae', optimizer='sgd')

train_X.shape --->(1, 5, 32)
  history = model.fit(train_X, scaled_train_y, epochs=3, batch_size=2, validation_data=(train_X, scaled_train_y), verbose=2, shuffle=True)

Train on 1 samples, validate on 1 samples
Epoch 1/3
 - 14s - loss: 0.3944 - val_loss: 0.3891
Epoch 2/3
 - 0s - loss: 0.3891 - val_loss: 0.3839
Epoch 3/3
 - 0s - loss: 0.3839 - val_loss: 0.3786
train_X.shape --->(1, 5, 32)
  history = model.fit(train_X, scaled_train_y, epochs=3, batch_size=2, validation_data=(train_X, scaled_train_y), verbose=2, shuffle=True)

Train on 1 samples, validate on 1 samples
Epoch 1/3
 - 14s - loss: 0.3944 - val_loss: 0.3891
Epoch 2/3
 - 0s - loss: 0.3891 - val_loss: 0.3839
Epoch 3/3
 - 0s - loss: 0.3839 - val_loss: 0.3786