Keras: BatchNormalization Layer gives inconsistent output for model.predict()

Created on 5 Mar 2019 · 30Comments · Source: keras-team/keras

I am currently on Keras 2.2.4 and Tensorflow 1.12.0. This issue was also observed on Keras 2.1.6 with TF 1.8.0.

So I have a UNet with batchnorm trained on my dataset. After done training, I use the model to predict segmentation output from unseen images.

soft_predictions = self.inference_model.predict(np.vstack(images))

Sometimes I pass multiple images at a time, but sometimes it could be just one image. I notice that the segmentation output of image A differs between two cases: (1) if I pass image A with other images; and (2) if I pass only image A.

With other images:
test_image_generate_output

On its own:
test_image_analyse

It might not be too obvious here, but the values of the pixels are different. Also, please excuse the performance of the network. It was trained only on a few images with a few iterations, but this is not a problem. The inconsistency is observed on well-trained networks too.

Here are some other things that I have experimented on:
(1) Removing batch normalization from the network remedies this issue. The segmentation output is consistent from both scenarios. So I think I can safely say that the source of the issue is the BatchNorm layer. However, not using BtachNorm is not an option.
(2) I have also tried to set layer.trainable = False for all layers in my inference_model, to no avail.
(3) Also tried to set layer._per_input_updates = {} to all BatchNorm layers in inference_model, still no avail.
(4) Setting training=False when calling the BatchNorm layers in inference_model makes the network gives all 1.0 or 0.0 output.

If anybody could give me an idea of how to solve this problem, it would be much much appreciated. This issue is really annoying because it makes evaluation and putting the model into production very difficult.

tensorflow support

Source

twinanda

👍15

Most helpful comment

I believe I've figured it out, but no one will like the verdict, myself included: the problem is numeric precision, and BatchNormalization has nothing to do with it. It's a work in progress, but I've narrowed down the exact operation responsible: np.dot - you can follow the work here. Will update later with all relevant testing code, and a solution. I've also made OP's images into a gif for direct comparison:

(P.S. removing BatchNormalization in your tests may "solve" the problem, but it's _not_ BN that's problematic, but how it transforms input tensor dimensionalities - will clarify later)

OverLordGoldDragon on 11 Nov 2019

👍5

All 30 comments

I have a similar issue. When I use predict or evaluate with different batch_sizes I get different results. ( e.g., using predict for batch_size =16 I got dice of 0.88 and for batch_size = 32 I got a dice of 0.51).

Also, I noticed that the behaviour between evaluate and predict give distinct results, (e.g. for batch_size = 32, predict gives dice of 0.51 and evaluate gives 0.35). I would expect some minor difference due to truncation errors but not in this range.

Another issue is that if I run predict or evaluate more than one time on the same data I get slightly different results.

Without BatchNormalization these beahviours do not happen (apart from slight difference between evaluate and predict), so it must be related to BatchNorm.

jmgo on 22 Mar 2019

With what batch size did you train the model?

harshitAgr on 22 Mar 2019

I trained the model with a batch_size of 20.

I written an simple example to replicate the issue (batch_size of 40), the full code is attached in
BN2.zip (I am using Keras 2.2.0)

I tested with fcnn , a UNET-like architecture with BatchNorm and fcnn_no_batch_normalization which is the same network without BatchNorm.

    model = fcnn(47,47,47,2)
    #model = fcnn_no_batch_normalization(47, 47, 47, 2)

    model.summary(line_length=113)

    sgd = SGD(lr=0.01, decay=0, momentum=0.85, nesterov=False)
    model.compile(optimizer=sgd,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'],
                  sample_weight_mode='temporal')

    # dummy data fcnn
    n_samples = 1000
    out_size = 43

    # randomly generated inputs
    imgs_train = np.float32(randint(0, 1000, (n_samples, 47, 47, 47, 1)))

    # get output as being the pixels with intensity above 800 in a noisy version of imgs_train
    msks_train  = np.zeros((n_samples, out_size**3, 2))
    imgs_train2 = imgs_train + randint(0, 500, (n_samples, 47, 47, 47, 1)) # imgs_train + noise

    crop = (2,2)
    imgs_train2_crop  = (imgs_train2[:,crop[0]:-crop[1],crop[0]:-crop[1],crop[0]:-crop[1],0] > 800)
    msks_train[...,1] = imgs_train2_crop.reshape((n_samples, out_size**3))
    msks_train[...,0] = 1-imgs_train2_crop.reshape((n_samples, out_size**3))

    model.fit(imgs_train,
              msks_train,
              epochs=5,
              batch_size=40,
              verbose=True,
              shuffle=True)

    # predict accuracy for different batch sizes
    batchSizes = [1,5,32,53,98]
    for i in batchSizes:
        print ('batch size :', i, 'accuracy :', accu(msks_train, model.predict(imgs_train, batch_size=i)) )

The output with fcnn was

batch size : 1 accuracy : 0.8674336976618411
batch size : 5 accuracy : 0.86743371023935
batch size : 32 accuracy : 0.8674336976618411
batch size : 53 accuracy : 0.86743371023935
batch size : 98 accuracy : 0.86743371023935

and the output with fcnn_no_batch_normalization was:

batch size : 1 accuracy : 0.4484741343529501
batch size : 5 accuracy : 0.4484741343529501
batch size : 32 accuracy : 0.4484741343529501
batch size : 53 accuracy : 0.4484741343529501
batch size : 98 accuracy : 0.4484741343529501

In this code the differences are small, but I have a more complex network that the differences are larger (0.1 - 0.5 in accuracy on similar dummy data)

If anyone could help me out that would be great!

jmgo on 25 Mar 2019

👍1

Any updates on this? I am surprised how this inconsistency does not bother too many people. Or is this inconsistency caused by BatchNorm is something that is inherent in BatchNorm itself? I wonder whether other frameworks show this behavior too.

twinanda on 18 Apr 2019

any updates on this?

thekoshkina on 8 May 2019

I'm having same issue along with https://github.com/tensorflow/tensorboard/issues/1514 which seems to be keras bug rather than tf. Is there any keras version where this is working? Need it asap.

rajasrijan on 18 May 2019

👍1

Still no updates from my side. This issue is very pronounced when you are doing segmentation on an image in which the interesting structure is only a small part of the image. It seems like it is trying to normalize the output, which results in a high false positive rate.

twinanda on 21 May 2019

👍1

I am also haveing the same issue here :(

aeweiwi on 28 May 2019

same here, accuracy never improves

JasonLi-TMT on 31 May 2019

+1 #12851

OverLordGoldDragon on 8 Jun 2019

I have noticed this recently - predictions on the exact same data are very different at test time for segmentation problems.

pokeabear on 13 Jun 2019

I resolved my problem by standardizing a sample (in a batch of 32) which I inadvertently left non-standardized, with sigma=52 - which had severely disrupted the BN layers; to suggest a debug method, _monitor_ BNs (via e.g. heatmaps) throughout training, and see if they drastically change after any particular train iteration. Also, mind extreme outliers, and other anomalous behaviors.

OverLordGoldDragon on 24 Jun 2019

I've run into this issue as well.

Re-using an identical sample for training and validation, the model gets 100% accuracy while training but validation accuracy oscillates around 60%. Setting momentum to 0.0001 so that the "moving mean/variance" == "last batch mean/variance" did not fix the validation accuracy, so BatchNorm must be doing something else which is modifying the data at Validation time.

I'm using a TimeDistributed layer which might make my case unique, but here's the issue anyway with a reproducible example just in case:
https://github.com/tensorflow/tensorflow/issues/30109

I'm using tensorflow-gpu 1.13, Keras 2.2.4

LukeBolly on 25 Jun 2019

👍1

Exact same issue for me. Would love to hear an update regarding this.

Mushoz on 25 Jun 2019

I just discovered this same issue. I looked at the closed solution but I dont think it addresses the issue. Batch norm simply shifts and scales the data by a fixed amount derived from the exponential moving averages. This should be fixed at test time and indepdent of the batch contents.

See: https://www.quora.com/How-does-batch-normalization-behave-differently-at-training-time-and-test-time

isaacgerg on 13 Sep 2019

All,

I have created a very small test fixture which reproduces the error. Can you verify you get the bug? @twinanda, @LukeBolly

    import tensorflow as tf
    import numpy as np
    input1 = tf.keras.layers.Input(shape=(128,128,1))
    x = tf.keras.layers.BatchNormalization()(input1)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(5, activation = 'softmax')(x)

    model = tf.keras.models.Model(input1, x)

    batchSize = 32
    x = np.random.rand(batchSize, 128,128,1)

    print('The following rows should all be equal...')
    for k in range(1, batchSize):  
        y = model.predict(x[0:k,:,:,:])
        print(y[0,:])

isaacgerg on 15 Sep 2019

👍1

All,

I have created a very small test fixture which reproduces the error. Can you verify you get the bug? @twinanda, @LukeBolly

    import tensorflow as tf
    import numpy as np
    input1 = tf.keras.layers.Input(shape=(128,128,1))
    x = tf.keras.layers.BatchNormalization()(input1)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(5, activation = 'softmax')(x)

    model = tf.keras.models.Model(input1, x)

    batchSize = 32
    x = np.random.rand(batchSize, 128,128,1)

    print('The following rows should all be equal...')
    for k in range(1, batchSize):  
        y = model.predict(x[0:k,:,:,:])
        print(y[0,:])

Yes, I can reproduce the bug with this code. The rows are not equal. I have also commented on your original thread on TF. Thank you.

twinanda on 16 Sep 2019

👍2

@twinanda Thanks. Have you noticed any patterns to your observations from yoru original reconstruction experiment? Does it happen on the test set AND training set data? If you move the image under test to different parts of the batch, do you have the same issue?

isaacgerg on 16 Sep 2019

@isaacgerg I have noticed this in my experiment. Let's assume a person segmentation problem. Assume image A has 1 person and image B has 5 persons; and my training data has an average of 4-6 persons in the image. I did the following scenarios:

passing image B by itself to the model (using model.predict()), great result
passing image A by itself, the network makes many false positive segments
passing image A in a batch with other images, the result on image A makes sense again

Note that image A and B are from a static camera (background is the same). So, it does not really make sense why in the second case the network segments background as people, while for image A, the network performs super well.

I have not really compared the result whether I pass testing or training data, neither the position of the image in the batch.

twinanda on 17 Sep 2019

I found the similar problem. I though it is the batch normalization's problem as my previous batch size is too small. When I changed the batch size to 8 (it is not that big also), I still encounter the similar problem.
There is a significant inconsistent between training and test.

Here you can see two experiments:
The red one without_BN: I do not use BN
the dark green on is with BN.
The bottom curve are the total training loss and for training both of these two curves converge.
However, if we check the real time loss, which is based on loss in test mode, there is a huge difference between without_BN and with_BN. Please check the upper curves.
I set the learning phase as 0 to make it behave in test mode.

 eval_dict[K.learning_phase()]= 0
 summaries = s.run(merged_summaries, feed_dict=eval_dict)

Because of this, I got a super bad performance for the test data by using batch normalization.

Hope the info I provided here can help fix the problem. Thanks.

liketheflower on 17 Oct 2019

Wow, not even setting trainable=False fixes the issue. No wonder I can't replicate my PyTorch results.

depthwise on 30 Oct 2019

I've began to look at the issue, but not yet fully investigating. Observations:

@isaacgerg 's "all rows equal" code does _not_ reproduce the error in TensorFlow 2.0.0 + Keras 2.3.0/2.3.1; all rows _are_ identical. In TensorFlow 1.14.0 + Keras 2.2.5, however, the rows aren't identical.

Let me know if this solves all parts of the issue.

OverLordGoldDragon on 1 Nov 2019

Here's TF 2.0.0 CPU release version, and its bundled Keras. The bug is evident.

Keras version 2.2.4-tf
TF version 2.0.0
The following rows should all be equal...
[0.25248486 0.00714255 0.29660103 0.34764042 0.09613115]
[0.25248486 0.00714256 0.296601   0.34764037 0.09613117]
[0.25248486 0.00714256 0.296601   0.34764037 0.09613117]
[0.25248486 0.00714256 0.296601   0.34764037 0.09613117]
[0.2524849  0.00714256 0.2966009  0.34764042 0.09613118]
[0.2524849  0.00714256 0.2966009  0.34764042 0.09613118]
[0.2524849  0.00714256 0.2966009  0.34764042 0.09613118]
[0.2524849  0.00714256 0.2966009  0.34764042 0.09613118]

This is with tf.keras.backend.set_learning_phase(0), of course.

depthwise on 1 Nov 2019

Here's "standalone" Keras 2.3.1:

Keras version 2.3.1
TF version 2.0.0
The following rows should all be equal...
[0.22926576 0.3467834  0.18447833 0.18067528 0.05879718]
[0.22926575 0.3467835  0.18447842 0.18067518 0.0587972 ]
[0.22926575 0.3467835  0.18447842 0.18067518 0.0587972 ]
[0.22926575 0.3467835  0.18447842 0.18067518 0.0587972 ]
[0.2292658  0.34678343 0.18447843 0.1806752  0.0587972 ]
[0.2292658  0.34678343 0.18447843 0.1806752  0.0587972 ]
[0.2292658  0.34678343 0.18447843 0.1806752  0.0587972 ]
[0.2292658  0.34678343 0.18447843 0.1806752  0.0587972 ]
[0.2292658  0.34678343 0.18447843 0.1806752  0.0587972 ]

Bug is evident as well. I'm not sure how you were testing @OverLordGoldDragon, please share details of your set-up.

depthwise on 1 Nov 2019

Here's the same test on the GPU version of TF 2.0:

Keras version 2.3.1
TF version 2.0.0
The following rows should all be equal...
2019-11-01 22:22:30.073645: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[0.23161781 0.31788135 0.28815767 0.03564478 0.12669837]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161776 0.3178814  0.2881577  0.03564478 0.12669833]
[0.23161778 0.3178814  0.28815764 0.03564478 0.12669839]
[0.23161778 0.3178814  0.28815764 0.03564478 0.12669839]
[0.23161778 0.3178814  0.28815764 0.03564478 0.12669839]
[0.23161778 0.3178814  0.28815764 0.03564478 0.12669839]

depthwise on 1 Nov 2019

Uh... I could've sworn they were the same, but now they aren't - I either got lucky, or visual cortex misfired.

Alright, I'm game - will investigate sometime.

OverLordGoldDragon on 1 Nov 2019

(P.S. removing BatchNormalization in your tests may "solve" the problem, but it's _not_ BN that's problematic, but how it transforms input tensor dimensionalities - will clarify later)

OverLordGoldDragon on 11 Nov 2019

👍5

I was able to produce a similar error without using batch norm on a very simply network. https://github.com/keras-team/keras/issues/13328

isaacgerg on 11 Nov 2019

@OverLordGoldDragon thanks for pursuing it much further than I ever could. I have another issue which might lead to a clue for the issue, or maybe to a completely different problem.
Unfortunatlye, I cannot send you my dataset because it is private.

Now riddle me this. Imagine you want to segment a white cat in front of a black background. Your training data includes cats passing by the background and images with just the background (no cats). Simple stuff. Your network converges, and your evaluation shows you have > 0.95 dice score. You are happy.
You pass test images with cats on the image, and your network performs amazingly.. until you call .predict and pass a single background image (with no cats on it). Suddenly, you see that the network tries to segment the dust or any random structures on the image as a cat. The false positive is strong on this one. gasp

twinanda on 15 Nov 2019

@twinanda You're welcome. On cats - sounds like overfitting, contrary to intuition; _the test set isn't a universal benchmark_ of your neural net. Part of ability to generalize includes _robustness to noise_ - which is often omitted from explicit testing. It's what roots "adversarial attacks" - in the extreme example, the "one-pixel attack", where a single pixel in an image is (intelligently) manipulated to trick the (well-trained) NN to think that a cat is a motorboat.

... or it's a bug. Can't tell much without model code and at least dataset info (shapes, quality, noise, etc). If you'd like, I can have a look if you post a minimally-reproducible example on StackOverflow.

In the meantime, I have a request specific to you, which does sort of "leak" the crux of my investigation: right after importing backend as K, run K.set_floatx('float64'), and rerun the exact code used to generate your original greyscale image. Let me know if the ultimate difference is nearly as dramatic.

OverLordGoldDragon on 15 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings