I am working on 3D image segmentation with a convolutional neural network in Keras 2.1.1 with tensorflow as backend and I am having problems with the fit_generator function. The problem is that although the error reported during training is relatively low, the error when predicting on a training or a test image is comparable to the minimally trained state (not a simple overfitting problem). After training the network on any given (train/test) sample very briefly, the evaluation error will match the reported error during training. Might this be a problem when using a fit_generator with batch size 1 when the model is using batch normalization?
When you use normalization (samplewise or featurewise) you might get strange results. I don't know your dataset nor your split but if you normalize your trainset on 90% of the dataset (substract trainset mean et divide by trainset std) then you get a trainset normalized by itself. If then you normalize the testset (or valset as you want) by only 10% of the dataset you might get a mean or std very different and so the normalization is not the same and you get bad results.
I don't think Batch Normalization would be an issue if your data is first normalized. If it's not, for the same reason as above you can get bad results and do the same verification.
Try printing means and standard deviation of the trainset and testset separately. If they are almost the same, that's not the issue, if they aren't you can normalize all your data before the train-test split.
The means and standard deviation of the train set are fairly similar.
Without normalization: Train: mean 225.492 std: 283.293. Valid: 233.008 std: 294.333.
With normalization: Train: mean 0.0586495 std: 0.973686. Valid: 0.0264291 std: 1.05183
I'm working with the mean and std of the training data when normalizing, since the mean and std of test/val data should not be available during training.
Furthermore, when using an identical generator for the validation data which also operates on the training data (with the same augmentation), the problem can still be observed. The loss reported for the training data after 100 epochs is 2.3168, while the loss for the "validation" data is 3.7922 (~4 after the first epoch). The prediction on these training samples results in the ~3.7 loss, but after training on a specific sample for a handful of epochs it is back to ~2.3.
I'm not sure if it will help but try this :
Build a generator with normalization. When you fit, give the entire dataset. Do it for your both generator. Then you train and tell me if you still get something wrong. It not, that's the normalization issue i talked about. If it doesn't fail it only means your network isn't working whereas it's too small or too big. Didn't you get something similar with known working networks ? Pre trained and not ?
I conducted some tests with a small demo model based on https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py . I removed the dropout layers and added three batch normalization layers. Furthermore, I also use a generator for the validation data which is identical to the train generator (It samples training data with the same augmentation).
When training with a batch size of 20 and thus 2500 steps per epoch the first 5 epochs show this:
Epoch 1/100
2500/2500 [==============================] - 42s - loss: 1.4959 - acc: 0.4666 - val_loss: 1.3534 - val_acc: 0.5166
Epoch 2/100
2500/2500 [==============================] - 41s - loss: 1.1556 - acc: 0.5901 - val_loss: 1.1557 - val_acc: 0.5895
Epoch 3/100
2500/2500 [==============================] - 40s - loss: 1.0115 - acc: 0.6423 - val_loss: 0.9839 - val_acc: 0.6545
Epoch 4/100
2500/2500 [==============================] - 39s - loss: 0.9060 - acc: 0.6824 - val_loss: 0.9007 - val_acc: 0.6838
Epoch 5/100
2500/2500 [==============================] - 39s - loss: 0.8331 - acc: 0.7095 - val_loss: 0.9382 - val_acc: 0.6771
This is in line with the results of the original demo. Just adding batch normalization to the model enhances the performance. And we can see that the scores from train and val match nicely (which is of course expected as it is the same data - except for small differences in data augmentation).
When training with a batch size of 1 and thus 50000 steps per epoch the first 5 epochs show this:
Epoch 1/100
50000/50000 [==============================] - 620s - loss: 1.4782 - acc: 0.4933 - val_loss: 2.9329 - val_acc: 0.4088
Epoch 2/100
50000/50000 [==============================] - 587s - loss: 1.2395 - acc: 0.5923 - val_loss: 3.6340 - val_acc: 0.3810
Epoch 3/100
50000/50000 [==============================] - 627s - loss: 1.2696 - acc: 0.6002 - val_loss: 3.1556 - val_acc: 0.4210
Epoch 4/100
50000/50000 [==============================] - 617s - loss: 1.3360 - acc: 0.5940 - val_loss: 3.7793 - val_acc: 0.394
Here, I am again using the same data (except for possibly slightly different data augm.) for training and validation. Here, we clearly see that although it is the same data and the model should (over)fit, that both the overall train score is a lot worse and the val score is just random / garbage! (We already know, however, that re-training on those images would still lead to a score matching train but that is besides the point.)
When training with a batch size of 1, but with no batch normalization layer:
Epoch 2/100
50000/50000 [==============================] - 387s - loss: 1.4959 - acc: 0.5051 - val_loss: 1.7481 - val_acc: 0.5102
Epoch 3/100
50000/50000 [==============================] - 397s - loss: 1.5852 - acc: 0.4944 - val_loss: 1.5345 - val_acc: 0.4752
Epoch 4/100
50000/50000 [==============================] - 382s - loss: 1.7295 - acc: 0.4743 - val_loss: 1.5991 - val_acc: 0.4515
Epoch 5/100
50000/50000 [==============================] - 347s - loss: 1.8886 - acc: 0.4375 - val_loss: 1.6135 - val_acc: 0.4212
Here we show that when we remove both dropout and BN the scores are bad but at least (as it should be in theory, too) that train and val scores do match!
This clearly shows, that there seems to be a problem when using batch_normalization with a batch size of 1 and a fit_generator. Changing the batch size to 1 completely crashes the validation error (which is also calculated on the training data) and removing batch norm while keeping the batch size at 1 shows a bad performance but a very similar train and validation error.
The demo code:
from __future__ import print_function`
import keras
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
import os
batch_size = 1 ## Change here
num_classes = 10
epochs = 100
data_augmentation = True
num_predictions = 20
save_dir = os.path.join(os.getcwd(), 'saved_models')
model_name = 'keras_cifar10_trained_model.h5'
# The data, shuffled and split between train and test sets:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# Convert class vectors to binary class matrices.
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
input_shape=x_train.shape[1:]))
model.add(Activation('relu'))
model.add(Conv2D(32, (3, 3)))
model.add(BatchNormalization()) #New
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(BatchNormalization()) #New
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(BatchNormalization()) #New
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
# initiate RMSprop optimizer
opt = keras.optimizers.rmsprop(lr=0.0001, decay=1e-6)
# Let's train the model using RMSprop
model.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
if not data_augmentation:
print('Not using data augmentation.')
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test),
shuffle=True)
else:
print('Using real-time data augmentation.')
# This will do preprocessing and realtime data augmentation:
datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=0, # randomly rotate images in the range (degrees, 0 to 180)
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=True, # randomly flip images
vertical_flip=False) # randomly flip images
datagen2 = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=0, # randomly rotate images in the range (degrees, 0 to 180)
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=True, # randomly flip images
vertical_flip=False) # randomly flip images
# Compute quantities required for feature-wise normalization
# (std, mean, and principal components if ZCA whitening is applied).
datagen.fit(x_train)
datagen2.fit(x_train)
# Fit the model on the batches generated by datagen.flow().
model.fit_generator(datagen.flow(x_train, y_train,
batch_size=batch_size),
steps_per_epoch=50000, ## Change w.r.t. batch size
epochs=epochs,
validation_data=datagen2.flow(x_train, y_train,
batch_size=batch_size),
validation_steps = 50000, ## Change w.r.t. batch size
workers=4)
# Save model and weights
if not os.path.isdir(save_dir):
os.makedirs(save_dir)
model_path = os.path.join(save_dir, model_name)
model.save(model_path)
print('Saved trained model at %s ' % model_path)
# Score trained model.
scores = model.evaluate(x_test, y_test, verbose=1)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])
Updating to the current master did not solve the problem. However you are right, the problem also occurs when using the standard fit function:
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_train, y_train),
shuffle=True,
verbose=2)
Training for 1000 epochs with a batch size of 32 shows the expected behavior when using the same data for validation:
Epoch 500/500
- 13s - loss: 1.1946e-07 - acc: 1.0000 - val_loss: 1.1921e-07 - val_acc: 1.0000
However training for 1000 epochs when using a batch size of 1 shows the same problematic behavior:
Epoch 500/500
- 328s - loss: 0.8608 - acc: 0.8789 - val_loss: 4.7222 - val_acc: 0.5357
Did you check how the BatchNormalization works ?
If it substracts to all images the mean of the batch then you always give 0 as input.
As "elu" activation is equivalent to BatchNormalization (I thinks it's "Network In Network" paper) try using elus activations, it will work
Batch normalization works with a moving average so shouldn't the image in the batch subtracted by the mean produce a meaningful input after a while?
After all, the training error is still converging and only the model in test phase is failing as if the moving average is not utilized in the test phase.
I have the same problem. The BatchNormalization in both keras and tensorflow seems to behave differently in training and testing phase. Sometimes, it would greatly influence the validation results.
Exactly the same problem. Have you solved it now?
I am having the exact same problem where bn behavior is different in train vs test for an unknown reason. Anyone discovered a solution to this?
Most helpful comment
I conducted some tests with a small demo model based on https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py . I removed the dropout layers and added three batch normalization layers. Furthermore, I also use a generator for the validation data which is identical to the train generator (It samples training data with the same augmentation).
When training with a batch size of 20 and thus 2500 steps per epoch the first 5 epochs show this:
Epoch 1/100
2500/2500 [==============================] - 42s - loss: 1.4959 - acc: 0.4666 - val_loss: 1.3534 - val_acc: 0.5166
Epoch 2/100
2500/2500 [==============================] - 41s - loss: 1.1556 - acc: 0.5901 - val_loss: 1.1557 - val_acc: 0.5895
Epoch 3/100
2500/2500 [==============================] - 40s - loss: 1.0115 - acc: 0.6423 - val_loss: 0.9839 - val_acc: 0.6545
Epoch 4/100
2500/2500 [==============================] - 39s - loss: 0.9060 - acc: 0.6824 - val_loss: 0.9007 - val_acc: 0.6838
Epoch 5/100
2500/2500 [==============================] - 39s - loss: 0.8331 - acc: 0.7095 - val_loss: 0.9382 - val_acc: 0.6771
This is in line with the results of the original demo. Just adding batch normalization to the model enhances the performance. And we can see that the scores from train and val match nicely (which is of course expected as it is the same data - except for small differences in data augmentation).
When training with a batch size of 1 and thus 50000 steps per epoch the first 5 epochs show this:
Epoch 1/100
50000/50000 [==============================] - 620s - loss: 1.4782 - acc: 0.4933 - val_loss: 2.9329 - val_acc: 0.4088
Epoch 2/100
50000/50000 [==============================] - 587s - loss: 1.2395 - acc: 0.5923 - val_loss: 3.6340 - val_acc: 0.3810
Epoch 3/100
50000/50000 [==============================] - 627s - loss: 1.2696 - acc: 0.6002 - val_loss: 3.1556 - val_acc: 0.4210
Epoch 4/100
50000/50000 [==============================] - 617s - loss: 1.3360 - acc: 0.5940 - val_loss: 3.7793 - val_acc: 0.394
Here, I am again using the same data (except for possibly slightly different data augm.) for training and validation. Here, we clearly see that although it is the same data and the model should (over)fit, that both the overall train score is a lot worse and the val score is just random / garbage! (We already know, however, that re-training on those images would still lead to a score matching train but that is besides the point.)
When training with a batch size of 1, but with no batch normalization layer:
Epoch 2/100
50000/50000 [==============================] - 387s - loss: 1.4959 - acc: 0.5051 - val_loss: 1.7481 - val_acc: 0.5102
Epoch 3/100
50000/50000 [==============================] - 397s - loss: 1.5852 - acc: 0.4944 - val_loss: 1.5345 - val_acc: 0.4752
Epoch 4/100
50000/50000 [==============================] - 382s - loss: 1.7295 - acc: 0.4743 - val_loss: 1.5991 - val_acc: 0.4515
Epoch 5/100
50000/50000 [==============================] - 347s - loss: 1.8886 - acc: 0.4375 - val_loss: 1.6135 - val_acc: 0.4212
Here we show that when we remove both dropout and BN the scores are bad but at least (as it should be in theory, too) that train and val scores do match!
This clearly shows, that there seems to be a problem when using batch_normalization with a batch size of 1 and a fit_generator. Changing the batch size to 1 completely crashes the validation error (which is also calculated on the training data) and removing batch norm while keeping the batch size at 1 shows a bad performance but a very similar train and validation error.
The demo code: