Hello all,
I have tried to look in Keras models.py but I'm not experienced enough to answer this question.
I'm not sure if it's the expected behaviour and if I understood well how to use validation_data
but I get different results for loss
and val_loss
using the same data.
Train on 100000 samples, validate on 100000 samples
Epoch 0
100000/100000 [==============================] - 144s - loss: 0.8586 - val_loss: 0.8122
Epoch 1
100000/100000 [==============================] - 149s - loss: 0.8168 - val_loss: 0.8112
Epoch 2
100000/100000 [==============================] - 149s - loss: 0.8095 - val_loss: 0.8107
Epoch 3
100000/100000 [==============================] - 149s - loss: 0.8032 - val_loss: 0.8102
Epoch 4
100000/100000 [==============================] - 151s - loss: 0.8001 - val_loss: 0.8069
Epoch 5
100000/100000 [==============================] - 150s - loss: 0.7941 - val_loss: 0.8022
Epoch 6
100000/100000 [==============================] - 144s - loss: 0.7898 - val_loss: 0.7992
Epoch 7
100000/100000 [==============================] - 145s - loss: 0.7872 - val_loss: 0.8005
Here is my code:
X_1 = pd.read_pickle("X_1.pkl")
X_2 = pd.read_pickle("X_2.pkl")
y_train = pd.read_pickle("y_train.pkl")
left = Sequential()
left.add(Embedding(len(X_1), 64))
left.add(LSTM(64, 64,
forget_bias_init='one'))
left.add(Dropout(0.1))
left.add(Dense(64, 64))
right = Sequential()
right.add(Masking(mask_value=0.))
right.add(LSTM(X_train.shape[2], 64,
forget_bias_init='one',
return_sequences=False))
right.add(Dropout(0.1))
right.add(Dense(64,64))
model = Sequential()
model.add(Merge([left, right], mode='sum'))
model.add(Dense(64, 1, W_regularizer=l1l2(0.01)))
model.add(ParametricSoftplus(1))
model.compile(loss="poisson_loss", optimizer='rmsprop')
components = model.fit([X_1, X_2], y_train, batch_size=128,
nb_epoch=20, validation_data=([X_1, X_2], y_train),
shuffle=False)
Thank you!
I believe it is the dropout. During training (loss
) dropout is on. For validation (val_loss
) dropout is off.
Dropout basically shuts down a few weights of the network randomly. Think of it like a system where part of it fails regularly. The other parts have to learn how to do their best, even if they can't count with their peers. During validation the FULL system is on. This helps you deal with overfitting.
Try training your model without dropout to test this hypothesis. Let me know if I'm wrong.
Thanks for your help @EderSantana, unfortunately I'm testing it right now and I still have a difference between the loss and the validation loss.
(I thought the dropout was not used for a forward pass where we only use the learnt weights to predict the outputs right?)
The other reason that the results are different is because the model is being trained while the "loss" is being computed, whereas the model is fixed while "val_loss" is being computed. Since the model is training, "loss" is typically going to be larger than the true training set loss at the end of the epoch. I.e. "loss" is the average loss during the epoch, and "val_loss" is the average loss after the end of the epoch. Since the model changes during the epoch, the loss changes.
Besides all these, the two loss values are being computed on two distinct sets of samples. So you can't expect them to be equal. In fact the difference between the training loss and the validation/test loss can give great insight into your model complexity for the problem in hand and also the training progress of your model (overfitting, underfitting).
Thanks @dhammack it was my next question.
I was confused because I implemented my first neural nets in Theano and I was looking at the loss at the end of the epoch for both the training and the validation set (with the same data). I had the exact same results.
I'm graphing these losses and I'm not sure if we can compare these 2 metrics?
@erfannoury I'm using the same data that's why I was expecting the same results.
model.fit([X_1, X_2], y_train, batch_size=128,
nb_epoch=20, validation_data=([X_1, X_2], y_train),
shuffle=False)
@tboquet I didn't notice.
Well it's usually not a correct practice to use training data for validation.
Anyways, the issue you have found is something else.
Maybe the loss displayed for the training data is the loss for the last batch of data, while the loss for validation data is the average loss for the whole validation set. It's just a guess, to be sure I should take a look at the code.
@erfannoury I didn't want to use training data for validation, I wanted to diagnose my network with a real test data set but I saw that I had my training loss always greater than my validation loss. Investigating further the saved value in the history doesn't seem to be the loss evaluated with the last weights but an average of every minibatch (a lower bound).
I'm not sure of the average and I'm still looking in the code to figure out if it is the case.
I'm just wondering if we really want that to be the default training loss saved. It's still possible to implement a custom callback and recalculate this loss and I will try this to compare this to the saved training loss :).
Good find @tboquet, I just got confused by this myself. So if we want to evaluate the loss on the training set at the end of every epoch we need to call model.evaluate(...) with the training set in the on_epoch_end callback?
Yep you should add another history callbach and evaluate the model at the end of each epoch! It may take some time though because you will have to go through all your minibatches. I could be a good idea to sample your training data and pass it to your callback.
I am seeing the same issue. It seems that the answer is the training loss is computed as an average for all the minibatches and the validation loss is computed on the whole set. Is this correct?
As it is stated in https://keras.io/getting-started/faq/#why-is-the-training-loss-much-higher-than-the-testing-loss, during training (loss) dropout is on while for validation (val_loss) dropout is off. Also, the training loss is computed as an average for all the minibatches and the validation loss is computed on the whole set.
@curiale You are correct. However, in the model I was using, there was no dropout. I discovered that the problem was related to the EMA in batch normalization -- it was set to the default which was way to high for my application.
Hi, @isaacgerg. The training loss is computed as an average for all the minibatches and the validation loss is computed on the whole set.
@curiale thank you.
Is there any way to display the batch-wise loss during training, instead of a cumulative average? This can cause a lot of confusion: let's say you start with a loss of around 4.0 on your first batches. Then you see the loss dropping slowly to 3.0, 2.0, etc... then there is no way to now if the net is actually learning slowly, or if you just witnessed a very abrupt decreasing loss at some point, since you're seeing an average of all your batches loss.
Here is an example where you see similar averages on Keras, but the behaviour of your net is completely different (obviously the right one is very overdone but you see my point)
Thanks @isaacgerg! In my experiment, I tried to overfit only 2 samples but observed a huge difference between training and validation loss (on the same 2 samples). Inspired by @isaacgerg, I disabled the batchnorm layers and it worked!
@fwtan Your welcome. Regarding the batchnorm layers, I find I have to adjust the averaging coefficients to get good results. The current coefs don't average enough for me. I use something around 0.6.
Hi Guys,
I have faced a weird problem. My training loss for a a specific data set is always higher than my validation set. It does not matter how many dataset I want to train. For example, I tested it just for 2 training data set and 3 iteration and I expected that it over fits but strangely it did not happen. I also selected the same validation and training set but the loss value was different. any help is appreciated
model = Sequential()
model.add(LSTM(60, input_shape=(train_X.shape[1], train_X.shape[2]),return_sequences=False))
model.add(Dense(1,activation='sigmoid'))
sgd = optimizers.SGD(lr=0.01, decay=1e-2, momentum=0.9)
model.compile(loss='mae', optimizer='sgd')
train_X.shape --->(1, 5, 32)
history = model.fit(train_X, scaled_train_y, epochs=3, batch_size=2, validation_data=(train_X, scaled_train_y), verbose=2, shuffle=True)
Train on 1 samples, validate on 1 samples
Epoch 1/3
- 14s - loss: 0.3944 - val_loss: 0.3891
Epoch 2/3
- 0s - loss: 0.3891 - val_loss: 0.3839
Epoch 3/3
- 0s - loss: 0.3839 - val_loss: 0.3786
train_X.shape --->(1, 5, 32)
history = model.fit(train_X, scaled_train_y, epochs=3, batch_size=2, validation_data=(train_X, scaled_train_y), verbose=2, shuffle=True)
Train on 1 samples, validate on 1 samples
Epoch 1/3
- 14s - loss: 0.3944 - val_loss: 0.3891
Epoch 2/3
- 0s - loss: 0.3891 - val_loss: 0.3839
Epoch 3/3
- 0s - loss: 0.3839 - val_loss: 0.3786
Most helpful comment
The other reason that the results are different is because the model is being trained while the "loss" is being computed, whereas the model is fixed while "val_loss" is being computed. Since the model is training, "loss" is typically going to be larger than the true training set loss at the end of the epoch. I.e. "loss" is the average loss during the epoch, and "val_loss" is the average loss after the end of the epoch. Since the model changes during the epoch, the loss changes.