I am trying to convert a pre-trained ResNet-50 from Caffe to Torch. I've built the structure in Torch and copied the parameters from Caffe to Torch. When I compare the layer output, after first batch normalization layer, the output starts becoming different. I notice BN layer implementation in Caffe does not include a factor and bias (which is implemented in scale layer), so I tried calculating the normalization part independently by (inputData - mean)/sqrt(variance + eps), but the BN layer output in Caffe is different to this as well.
Here is the first two layers prototxt:
layer {
name: "conv_1"
type: "Convolution"
bottom: "data"
top: "conv_1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 64
pad: 3
kernel_size: 7
stride: 2
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "bn_1"
type: "BatchNorm"
bottom: "conv_1"
top: "conv_1"
param {
lr_mult: 0
decay_mult: 0
}
param {
lr_mult: 0
decay_mult: 0
}
param {
lr_mult: 0
decay_mult: 0
}
batch_norm_param {
use_global_stats: true
}
}
Here is how I get parameters from BN layer:
net.params['bn_1'][0].data --- mean
first element: -1345.0353 --- mean for first channel
net.params['bn_1'][1].data --- variance
first element: 1692061.5 --- variance for first channel
net.params['bn_1'][2].data --- not sure this one is eps or moving average factor
first and only element: 999.98236084
The input of BN layer is the output of Convolution layer, which I calculate by this:
img = np.full((1,3,224,224), 0.5) -- define a matrix with all elements are 0.5
net.blobs['conv_1'].data[...] = img
output = net.forward(start='data', end='conv_1')
The first element of convolution layer output is: -0.340065956
So I expect the first element of BN layer output should be:
(-0.340065956 - (-1345.0353))/sqrt(1692061.5) = 1.03375
But the first element of BN layer output is : 0.02443156
code: net.forward(start='data', end='bn_1')
There is a huge difference and a small eps won't change the result from 1.03375 to 0.02443.
I am not sure whether I missed some calculation or I did some miscalculation. Could anyone help me with that? Thank you!
Got it now.
net.params['bn_1'][2].data scores a scale factor, which I should divide from each mean and variance at this layer.
So in Caffe, the output of BN layer calculated by: (input - mean / scale_factor) / sqrt(var / scale_factor + eps)
Hello Xuefei
I was having the same question until I came across your post. But I was wondering how you figured out the Caffe calculation, now when I go back and look at the source code, I somewhat see the calculation. Is the source code the only way to figure that out or is there a better way?
Thank you
@vnalluri Glad this post helps. I tried comparing the parameters of BN layer both in Torch and Caffe first, I noticed in Caffe BN is represented by two layers, and there's one additional parameter. Then I tried to Google the BN layer implementation differences between Torch and Caffe, not much information. After these, I looked into the source code both in Caffe and Torch and found the difference.
Even though I solved this problem by the last thing I've tried, I suggest to not look directly into the low-level implementation code at the first place. I excluded the other possibilities which were easier to identify, and left the most 'difficult' one last.
Thank you @XuefeiW . One question I have is that I observe the third blob to always have the same value of: 999.98236084 (which is also reported on your comment), on different networks, different datasets, and different iteration numbers. My understanding is that the third blob should hold the (roughly) the number of batch norm iterations, weighted appropriately by the moving_average_fraction. Do you also observe having 999.98236084, or a number close to it, all the time?
@extragoya nope. I use resnet-18 from https://github.com/HolmesShuan/ResNet-18-Caffemodel-on-ImageNet, and the third blob of first BN layer is 9.99999332.
Got it now.
net.params['bn_1'][2].data scores a scale factor, which I should divide from each mean and variance at this layer.
So in Caffe, the output of BN layer calculated by: (input - mean / scale_factor) / sqrt(var / scale_factor + eps)
Thank you sooooooooooo much!! I've been confused by this for a long time until I see this!!
Most helpful comment
Got it now.
net.params['bn_1'][2].data scores a scale factor, which I should divide from each mean and variance at this layer.
So in Caffe, the output of BN layer calculated by: (input - mean / scale_factor) / sqrt(var / scale_factor + eps)