I am currently on Keras 2.2.4 and Tensorflow 1.12.0. This issue was also observed on Keras 2.1.6 with TF 1.8.0.
So I have a UNet with batchnorm trained on my dataset. After done training, I use the model to predict segmentation output from unseen images.
soft_predictions = self.inference_model.predict(np.vstack(images))
Sometimes I pass multiple images at a time, but sometimes it could be just one image. I notice that the segmentation output of image A differs between two cases: (1) if I pass image A with other images; and (2) if I pass only image A.
With other images:

On its own:

It might not be too obvious here, but the values of the pixels are different. Also, please excuse the performance of the network. It was trained only on a few images with a few iterations, but this is not a problem. The inconsistency is observed on well-trained networks too.
Here are some other things that I have experimented on:
(1) Removing batch normalization from the network remedies this issue. The segmentation output is consistent from both scenarios. So I think I can safely say that the source of the issue is the BatchNorm layer. However, not using BtachNorm is not an option.
(2) I have also tried to set layer.trainable = False for all layers in my inference_model, to no avail.
(3) Also tried to set layer._per_input_updates = {} to all BatchNorm layers in inference_model, still no avail.
(4) Setting training=False when calling the BatchNorm layers in inference_model makes the network gives all 1.0 or 0.0 output.
If anybody could give me an idea of how to solve this problem, it would be much much appreciated. This issue is really annoying because it makes evaluation and putting the model into production very difficult.
I have a similar issue. When I use predict or evaluate with different batch_sizes I get different results. ( e.g., using predict for batch_size =16 I got dice of 0.88 and for batch_size = 32 I got a dice of 0.51).
Also, I noticed that the behaviour between evaluate and predict give distinct results, (e.g. for batch_size = 32, predict gives dice of 0.51 and evaluate gives 0.35). I would expect some minor difference due to truncation errors but not in this range.
Another issue is that if I run predict or evaluate more than one time on the same data I get slightly different results.
Without BatchNormalization these beahviours do not happen (apart from slight difference between evaluate and predict), so it must be related to BatchNorm.
With what batch size did you train the model?
I trained the model with a batch_size of 20.
I written an simple example to replicate the issue (batch_size of 40), the full code is attached in
BN2.zip (I am using Keras 2.2.0)
I tested with fcnn , a UNET-like architecture with BatchNorm and fcnn_no_batch_normalization which is the same network without BatchNorm.
model = fcnn(47,47,47,2)
#model = fcnn_no_batch_normalization(47, 47, 47, 2)
model.summary(line_length=113)
sgd = SGD(lr=0.01, decay=0, momentum=0.85, nesterov=False)
model.compile(optimizer=sgd,
loss='categorical_crossentropy',
metrics=['accuracy'],
sample_weight_mode='temporal')
# dummy data fcnn
n_samples = 1000
out_size = 43
# randomly generated inputs
imgs_train = np.float32(randint(0, 1000, (n_samples, 47, 47, 47, 1)))
# get output as being the pixels with intensity above 800 in a noisy version of imgs_train
msks_train = np.zeros((n_samples, out_size**3, 2))
imgs_train2 = imgs_train + randint(0, 500, (n_samples, 47, 47, 47, 1)) # imgs_train + noise
crop = (2,2)
imgs_train2_crop = (imgs_train2[:,crop[0]:-crop[1],crop[0]:-crop[1],crop[0]:-crop[1],0] > 800)
msks_train[...,1] = imgs_train2_crop.reshape((n_samples, out_size**3))
msks_train[...,0] = 1-imgs_train2_crop.reshape((n_samples, out_size**3))
model.fit(imgs_train,
msks_train,
epochs=5,
batch_size=40,
verbose=True,
shuffle=True)
# predict accuracy for different batch sizes
batchSizes = [1,5,32,53,98]
for i in batchSizes:
print ('batch size :', i, 'accuracy :', accu(msks_train, model.predict(imgs_train, batch_size=i)) )
The output with fcnn was
batch size : 1 accuracy : 0.8674336976618411
batch size : 5 accuracy : 0.86743371023935
batch size : 32 accuracy : 0.8674336976618411
batch size : 53 accuracy : 0.86743371023935
batch size : 98 accuracy : 0.86743371023935
and the output with fcnn_no_batch_normalization was:
batch size : 1 accuracy : 0.4484741343529501
batch size : 5 accuracy : 0.4484741343529501
batch size : 32 accuracy : 0.4484741343529501
batch size : 53 accuracy : 0.4484741343529501
batch size : 98 accuracy : 0.4484741343529501
In this code the differences are small, but I have a more complex network that the differences are larger (0.1 - 0.5 in accuracy on similar dummy data)
If anyone could help me out that would be great!
Any updates on this? I am surprised how this inconsistency does not bother too many people. Or is this inconsistency caused by BatchNorm is something that is inherent in BatchNorm itself? I wonder whether other frameworks show this behavior too.
any updates on this?
I'm having same issue along with https://github.com/tensorflow/tensorboard/issues/1514 which seems to be keras bug rather than tf. Is there any keras version where this is working? Need it asap.
Still no updates from my side. This issue is very pronounced when you are doing segmentation on an image in which the interesting structure is only a small part of the image. It seems like it is trying to normalize the output, which results in a high false positive rate.
I am also haveing the same issue here :(

same here, accuracy never improves
+1 #12851

I have noticed this recently - predictions on the exact same data are very different at test time for segmentation problems.
I resolved my problem by standardizing a sample (in a batch of 32) which I inadvertently left non-standardized, with sigma=52 - which had severely disrupted the BN layers; to suggest a debug method, _monitor_ BNs (via e.g. heatmaps) throughout training, and see if they drastically change after any particular train iteration. Also, mind extreme outliers, and other anomalous behaviors.
I've run into this issue as well.
Re-using an identical sample for training and validation, the model gets 100% accuracy while training but validation accuracy oscillates around 60%. Setting momentum to 0.0001 so that the "moving mean/variance" == "last batch mean/variance" did not fix the validation accuracy, so BatchNorm must be doing something else which is modifying the data at Validation time.
I'm using a TimeDistributed layer which might make my case unique, but here's the issue anyway with a reproducible example just in case:
https://github.com/tensorflow/tensorflow/issues/30109
I'm using tensorflow-gpu 1.13, Keras 2.2.4
Exact same issue for me. Would love to hear an update regarding this.
I just discovered this same issue. I looked at the closed solution but I dont think it addresses the issue. Batch norm simply shifts and scales the data by a fixed amount derived from the exponential moving averages. This should be fixed at test time and indepdent of the batch contents.
See: https://www.quora.com/How-does-batch-normalization-behave-differently-at-training-time-and-test-time
All,
I have created a very small test fixture which reproduces the error. Can you verify you get the bug? @twinanda, @LukeBolly
import tensorflow as tf
import numpy as np
input1 = tf.keras.layers.Input(shape=(128,128,1))
x = tf.keras.layers.BatchNormalization()(input1)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Dense(5, activation = 'softmax')(x)
model = tf.keras.models.Model(input1, x)
batchSize = 32
x = np.random.rand(batchSize, 128,128,1)
print('The following rows should all be equal...')
for k in range(1, batchSize):
y = model.predict(x[0:k,:,:,:])
print(y[0,:])
All,
I have created a very small test fixture which reproduces the error. Can you verify you get the bug? @twinanda, @LukeBolly
import tensorflow as tf import numpy as np input1 = tf.keras.layers.Input(shape=(128,128,1)) x = tf.keras.layers.BatchNormalization()(input1) x = tf.keras.layers.Flatten()(x) x = tf.keras.layers.Dense(5, activation = 'softmax')(x) model = tf.keras.models.Model(input1, x) batchSize = 32 x = np.random.rand(batchSize, 128,128,1) print('The following rows should all be equal...') for k in range(1, batchSize): y = model.predict(x[0:k,:,:,:]) print(y[0,:])
Yes, I can reproduce the bug with this code. The rows are not equal. I have also commented on your original thread on TF. Thank you.
@twinanda Thanks. Have you noticed any patterns to your observations from yoru original reconstruction experiment? Does it happen on the test set AND training set data? If you move the image under test to different parts of the batch, do you have the same issue?
@isaacgerg I have noticed this in my experiment. Let's assume a person segmentation problem. Assume image A has 1 person and image B has 5 persons; and my training data has an average of 4-6 persons in the image. I did the following scenarios:
Note that image A and B are from a static camera (background is the same). So, it does not really make sense why in the second case the network segments background as people, while for image A, the network performs super well.
I have not really compared the result whether I pass testing or training data, neither the position of the image in the batch.
I found the similar problem. I though it is the batch normalization's problem as my previous batch size is too small. When I changed the batch size to 8 (it is not that big also), I still encounter the similar problem.
There is a significant inconsistent between training and test.

Here you can see two experiments:
The red one without_BN: I do not use BN
the dark green on is with BN.
The bottom curve are the total training loss and for training both of these two curves converge.
However, if we check the real time loss, which is based on loss in test mode, there is a huge difference between without_BN and with_BN. Please check the upper curves.
I set the learning phase as 0 to make it behave in test mode.
eval_dict[K.learning_phase()]= 0
summaries = s.run(merged_summaries, feed_dict=eval_dict)
Because of this, I got a super bad performance for the test data by using batch normalization.
Hope the info I provided here can help fix the problem. Thanks.
Wow, not even setting trainable=False fixes the issue. No wonder I can't replicate my PyTorch results.
I've began to look at the issue, but not yet fully investigating. Observations:
@isaacgerg 's "all rows equal" code does _not_ reproduce the error in TensorFlow 2.0.0 + Keras 2.3.0/2.3.1; all rows _are_ identical. In TensorFlow 1.14.0 + Keras 2.2.5, however, the rows aren't identical.
Let me know if this solves all parts of the issue.
Here's TF 2.0.0 CPU release version, and its bundled Keras. The bug is evident.
Keras version 2.2.4-tf
TF version 2.0.0
The following rows should all be equal...
[0.25248486 0.00714255 0.29660103 0.34764042 0.09613115]
[0.25248486 0.00714256 0.296601 0.34764037 0.09613117]
[0.25248486 0.00714256 0.296601 0.34764037 0.09613117]
[0.25248486 0.00714256 0.296601 0.34764037 0.09613117]
[0.2524849 0.00714256 0.2966009 0.34764042 0.09613118]
[0.2524849 0.00714256 0.2966009 0.34764042 0.09613118]
[0.2524849 0.00714256 0.2966009 0.34764042 0.09613118]
[0.2524849 0.00714256 0.2966009 0.34764042 0.09613118]
This is with tf.keras.backend.set_learning_phase(0), of course.
Here's "standalone" Keras 2.3.1:
Keras version 2.3.1
TF version 2.0.0
The following rows should all be equal...
[0.22926576 0.3467834 0.18447833 0.18067528 0.05879718]
[0.22926575 0.3467835 0.18447842 0.18067518 0.0587972 ]
[0.22926575 0.3467835 0.18447842 0.18067518 0.0587972 ]
[0.22926575 0.3467835 0.18447842 0.18067518 0.0587972 ]
[0.2292658 0.34678343 0.18447843 0.1806752 0.0587972 ]
[0.2292658 0.34678343 0.18447843 0.1806752 0.0587972 ]
[0.2292658 0.34678343 0.18447843 0.1806752 0.0587972 ]
[0.2292658 0.34678343 0.18447843 0.1806752 0.0587972 ]
[0.2292658 0.34678343 0.18447843 0.1806752 0.0587972 ]
Bug is evident as well. I'm not sure how you were testing @OverLordGoldDragon, please share details of your set-up.
Here's the same test on the GPU version of TF 2.0:
Keras version 2.3.1
TF version 2.0.0
The following rows should all be equal...
2019-11-01 22:22:30.073645: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[0.23161781 0.31788135 0.28815767 0.03564478 0.12669837]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161776 0.3178814 0.2881577 0.03564478 0.12669833]
[0.23161778 0.3178814 0.28815764 0.03564478 0.12669839]
[0.23161778 0.3178814 0.28815764 0.03564478 0.12669839]
[0.23161778 0.3178814 0.28815764 0.03564478 0.12669839]
[0.23161778 0.3178814 0.28815764 0.03564478 0.12669839]
Uh... I could've sworn they were the same, but now they aren't - I either got lucky, or visual cortex misfired.
Alright, I'm game - will investigate sometime.
I believe I've figured it out, but no one will like the verdict, myself included: the problem is numeric precision, and BatchNormalization has nothing to do with it. It's a work in progress, but I've narrowed down the exact operation responsible: np.dot - you can follow the work here. Will update later with all relevant testing code, and a solution. I've also made OP's images into a gif for direct comparison:

(P.S. removing BatchNormalization in your tests may "solve" the problem, but it's _not_ BN that's problematic, but how it transforms input tensor dimensionalities - will clarify later)
I was able to produce a similar error without using batch norm on a very simply network. https://github.com/keras-team/keras/issues/13328
@OverLordGoldDragon thanks for pursuing it much further than I ever could. I have another issue which might lead to a clue for the issue, or maybe to a completely different problem.
Unfortunatlye, I cannot send you my dataset because it is private.
Now riddle me this. Imagine you want to segment a white cat in front of a black background. Your training data includes cats passing by the background and images with just the background (no cats). Simple stuff. Your network converges, and your evaluation shows you have > 0.95 dice score. You are happy.
You pass test images with cats on the image, and your network performs amazingly.. until you call .predict and pass a single background image (with no cats on it). Suddenly, you see that the network tries to segment the dust or any random structures on the image as a cat. The false positive is strong on this one. gasp
@twinanda You're welcome. On cats - sounds like overfitting, contrary to intuition; _the test set isn't a universal benchmark_ of your neural net. Part of ability to generalize includes _robustness to noise_ - which is often omitted from explicit testing. It's what roots "adversarial attacks" - in the extreme example, the "one-pixel attack", where a single pixel in an image is (intelligently) manipulated to trick the (well-trained) NN to think that a cat is a motorboat.
... or it's a bug. Can't tell much without model code and at least dataset info (shapes, quality, noise, etc). If you'd like, I can have a look if you post a minimally-reproducible example on StackOverflow.
In the meantime, I have a request specific to you, which does sort of "leak" the crux of my investigation: right after importing backend as K, run K.set_floatx('float64'), and rerun the exact code used to generate your original greyscale image. Let me know if the ultimate difference is nearly as dramatic.
Most helpful comment
I believe I've figured it out, but no one will like the verdict, myself included: the problem is numeric precision, and
BatchNormalizationhas nothing to do with it. It's a work in progress, but I've narrowed down the exact operation responsible:np.dot- you can follow the work here. Will update later with all relevant testing code, and a solution. I've also made OP's images into a gif for direct comparison:(P.S. removing
BatchNormalizationin your tests may "solve" the problem, but it's _not_ BN that's problematic, but how it transforms input tensor dimensionalities - will clarify later)