Darknet: Different results between CPU and GPU (Convolutional layer has problem when dilation > 1 and CUDNN is not active)

Created on 10 Feb 2020  Â·  11Comments  Â·  Source: AlexeyAB/darknet

Hi,

Using this network.cfg.txt, the detection results of the CUDA enabled build of this repository are different from the build without enabling CUDA. The CUDA enabled build produces better results. Using Tiny YoloV3, there is no difference.

Likely bug bug

Most helpful comment

@AlexeyAB
Thanks for your help. I tested the [reorg3d] layer independently with some manual input, but I didn't see any difference between the output of CPU and GPU in this layer. Next I added some code to forward_network() and forward_network_gpu() to calculate the total sum of output elements in each layer and compared them. Here is an example for my .cfg file:

sum[%layer] | CPU | GPU | CPU – GPU
-- | -- | -- | --
sum[0] | 613456.688756 | 613456.68694 | 0.001816
sum[1] | 351479.923 | 351487.230334 | -7.307334
sum[2] | 191303.358827 | 191302.797801 | 0.561026
sum[3] | 393020.180122 | 393037.718248 | -17.538126
sum[4] | 199410.899027 | 199409.187487 | 1.711540
sum[5] | 105360.992388 | 105360.366701 | 0.625687
sum[6] | 205008.054257 | 205008.69806 | -0.643803
sum[7] | 94430.198173 | 94430.778119 | -0.579946
sum[8] | 205949.192374 | 205946.756022 | 2.436352
sum[9] | 97521.177134 | 97517.417314 | 3.759820
sum[10] | 206803.803461 | 213020.145404 | -6216.341943
sum[11] | 97317.056703 | 110235.342102 | -12918.285399
sum[12] | 194838.233836 | 207752.759416 | -12914.525580
sum[13] | 89708.116007 | 99181.952804 | -9473.836797
sum[14] | 201686.877354 | 228163.717362 | -26476.840008
sum[15] | 92507.410713 | 105634.985002 | -13127.574289
sum[16] | 182215.526719 | 204816.937806 | -22601.411087
sum[17] | 103781.770411 | 103379.089134 | 402.681277
sum[18] | 187229.29314 | 196699.370117 | -9470.076977
sum[19] | 87579.457722 | 91369.179553 | -3789.721831
sum[20] | 191361.228132 | 194748.268687 | -3387.040555
sum[21] | 100206.415242 | 104301.907502 | -4095.492260
sum[22] | 50600.791245 | 52192.28841 | -1591.497165
sum[23] | 109251.02559 | 111743.667888 | -2492.642298
sum[24] | 64515.661368 | 61162.404301 | 3353.257067
sum[25] | 143081.962076 | 127353.707183 | 15728.254893
sum[26] | 205949.192374 | 205946.756022 | 2.436352
sum[27] | 44364.052057 | 44367.559203 | -3.507146
sum[28] | 44364.052057 | 44367.559203 | -3.507146
sum[29] | 187446.014133 | 171721.266386 | 15724.747747
sum[30] | 121096.92386 | 120751.287471 | 345.636389
sum[31] | 293955.45623 | 287118.418319 | 6837.037911
sum[32] | -137878.726125 | -132878.892983 | -4999.833142
sum[33] | 11432.234913 | 12133.453549 | -701.218636
sum[34] | 351479.923 | 351487.230334 | -7.307334
sum[35] | 168728.649886 | 168732.069919 | -3.420033
sum[36] | 196935.044232 | 196928.892621 | 6.151611
sum[37] | 97969.597906 | 97965.898785 | 3.699121
sum[38] | 97740.764431 | 97741.244955 | -0.480524
sum[39] | 54131.199064 | 54130.449057 | 0.750007
sum[40] | 41354.888962 | 41360.449196 | -5.560234
sum[41] | 27076.740513 | 27078.459734 | -1.719221
sum[42] | 28407.017031 | 28405.41383 | 1.603201
sum[43] | 64281.085072 | 64278.722779 | 2.362293
sum[44] | -6049.092757 | -6048.547794 | -0.544963
sum[45] | 1011.51904 | 1011.534915 | -0.015875

As it can be seen there is a significant difference between output of CPU and GPU in the 10th layer. According to the .cfg file the 10th layer is a convolutional layer with dilated convolution. I did the same test with tiny yolov3 (which does not have any dilated convolution) and output difference between CPU and GPU for all the layers was smaller than 16. Therefore the problem must be in calculation of dilated convolution. Currently I am trying to find and solve it but still didn't get any success yet.

All 11 comments

What params did you use in the Makefile in both cases?

Without CUDA:

GPU=0
CUDNN=0
CUDNN_HALF=0
OPENCV=0
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0

CUDA:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=0
OPENMP=0
LIBSO=1
ZED_CAMERA=0

@AlexeyAB

Could you please guide me with the directions to solve this problem? I am not sure from where to start or what to check?

Different results between CPU and GPU

Show the difference.

May be there is bug in [reorg3d] since it isn't well tested.
I think I should rewrite this layer from the scratch.
Try to find the error, or try to avoid reorg3d, does this help?

@AlexeyAB
Thanks for your help. I tested the [reorg3d] layer independently with some manual input, but I didn't see any difference between the output of CPU and GPU in this layer. Next I added some code to forward_network() and forward_network_gpu() to calculate the total sum of output elements in each layer and compared them. Here is an example for my .cfg file:

sum[%layer] | CPU | GPU | CPU – GPU
-- | -- | -- | --
sum[0] | 613456.688756 | 613456.68694 | 0.001816
sum[1] | 351479.923 | 351487.230334 | -7.307334
sum[2] | 191303.358827 | 191302.797801 | 0.561026
sum[3] | 393020.180122 | 393037.718248 | -17.538126
sum[4] | 199410.899027 | 199409.187487 | 1.711540
sum[5] | 105360.992388 | 105360.366701 | 0.625687
sum[6] | 205008.054257 | 205008.69806 | -0.643803
sum[7] | 94430.198173 | 94430.778119 | -0.579946
sum[8] | 205949.192374 | 205946.756022 | 2.436352
sum[9] | 97521.177134 | 97517.417314 | 3.759820
sum[10] | 206803.803461 | 213020.145404 | -6216.341943
sum[11] | 97317.056703 | 110235.342102 | -12918.285399
sum[12] | 194838.233836 | 207752.759416 | -12914.525580
sum[13] | 89708.116007 | 99181.952804 | -9473.836797
sum[14] | 201686.877354 | 228163.717362 | -26476.840008
sum[15] | 92507.410713 | 105634.985002 | -13127.574289
sum[16] | 182215.526719 | 204816.937806 | -22601.411087
sum[17] | 103781.770411 | 103379.089134 | 402.681277
sum[18] | 187229.29314 | 196699.370117 | -9470.076977
sum[19] | 87579.457722 | 91369.179553 | -3789.721831
sum[20] | 191361.228132 | 194748.268687 | -3387.040555
sum[21] | 100206.415242 | 104301.907502 | -4095.492260
sum[22] | 50600.791245 | 52192.28841 | -1591.497165
sum[23] | 109251.02559 | 111743.667888 | -2492.642298
sum[24] | 64515.661368 | 61162.404301 | 3353.257067
sum[25] | 143081.962076 | 127353.707183 | 15728.254893
sum[26] | 205949.192374 | 205946.756022 | 2.436352
sum[27] | 44364.052057 | 44367.559203 | -3.507146
sum[28] | 44364.052057 | 44367.559203 | -3.507146
sum[29] | 187446.014133 | 171721.266386 | 15724.747747
sum[30] | 121096.92386 | 120751.287471 | 345.636389
sum[31] | 293955.45623 | 287118.418319 | 6837.037911
sum[32] | -137878.726125 | -132878.892983 | -4999.833142
sum[33] | 11432.234913 | 12133.453549 | -701.218636
sum[34] | 351479.923 | 351487.230334 | -7.307334
sum[35] | 168728.649886 | 168732.069919 | -3.420033
sum[36] | 196935.044232 | 196928.892621 | 6.151611
sum[37] | 97969.597906 | 97965.898785 | 3.699121
sum[38] | 97740.764431 | 97741.244955 | -0.480524
sum[39] | 54131.199064 | 54130.449057 | 0.750007
sum[40] | 41354.888962 | 41360.449196 | -5.560234
sum[41] | 27076.740513 | 27078.459734 | -1.719221
sum[42] | 28407.017031 | 28405.41383 | 1.603201
sum[43] | 64281.085072 | 64278.722779 | 2.362293
sum[44] | -6049.092757 | -6048.547794 | -0.544963
sum[45] | 1011.51904 | 1011.534915 | -0.015875

As it can be seen there is a significant difference between output of CPU and GPU in the 10th layer. According to the .cfg file the 10th layer is a convolutional layer with dilated convolution. I did the same test with tiny yolov3 (which does not have any dilated convolution) and output difference between CPU and GPU for all the layers was smaller than 16. Therefore the problem must be in calculation of dilated convolution. Currently I am trying to find and solve it but still didn't get any success yet.

@AlexeyAB
Well, I think that I have found the problem, but I don't know how to fix it. Checking forward_convolutional_layer() and forward_convolutional_layer_gpu() I noticed that the functionality is similar except when CUDNN is active. I recompiled the code with GPU = 1 but CUDNN = 0 and CUDNN_HALF = 0 and did the above test. Without CUDNN, the results of CPU and GPU are almost similar. The problem does not exists when dilation = 1 in a convolutional layer.
When CUDNN is active cudnnSetConvolution2dDescriptor() and cudnnConvolutionForward() are used. When not, the only function that uses the dilation is im2col_cpu_ext() for CPU or im2col_gpu_ext() for GPU. I guess that the implementation of dilation in these two functions (which are both similar and based on the BVLC / caffe) is different from what is implemented in CUDNN. But I don't know how to make them similar to that.

I should note that I have a weight file which is obtained from training with CUDNN. With this weight file some objects can be detected on sample images when CUDNN is active. However, when CUDNN is not active, nothing is detected on the same images.

It seems that the implementation of convolutional layer when dilation > 1 and CUDNN is not active is not correct. Here is the result of training the network using GPU with and without enabling CUDNN:

GPU = 1, CUDNN = 1, CUDNN_HALF = 1
chart_dilation_cuDNN


GPU = 1, CUDNN = 0, CUDNN_HALF = 0
chart_dilation_gpu

Do you mean that for dilation=2 ?

  1. GPU=0 CUDNN=0 - not working properly
  2. GPU=1 CUDNN=0 - not working properly
  3. GPU=1 CUDNN=1 - working properly

Do you mean that for dilation=2 ?

1. `GPU=0 CUDNN=0` - not working properly

2. `GPU=1 CUDNN=0` - not working properly

3. `GPU=1 CUDNN=1` - working properly

Yes, exactly.

Finally I was able to fix the problem.

When CUDNN is active cudnnSetConvolution2dDescriptor() is called in convolutional_layer.c. The second and third arguments of this function are pad_h and pad_w:

cudnnStatus_t cudnnSetConvolution2dDescriptor(
    cudnnConvolutionDescriptor_t    convDesc,
    int                             pad_h,
    int                             pad_w,
    int                             u,
    int                             v,
    int                             dilation_h,
    int                             dilation_w,
    cudnnConvolutionMode_t          mode,
    cudnnDataType_t                 computeType)

and the function is called in convolutional_layer.c like this:

cudnnSetConvolution2dDescriptor(l->convDesc, l->pad * l->dilation, l->pad * l->dilation, l->stride_y, l->stride_x, l->dilation, l->dilation, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT)

On the other side when CUDNN is not active, For CPU im2col_cpu_ext() is called in convolutional_layer.c and for GPU without CUDNN im2col_gpu_ext() is called in convolutional_kernels.cu in a similar way.

im2col_cpu_ext(im,          // input
                    l.c / l.groups,         // input channels
                    l.h, l.w,               // input size (h, w)
                    l.size, l.size,         // kernel size (h, w)
                    l.pad, l.pad,           // padding (h, w)
                    l.stride_y, l.stride_x,     // stride (h, w)
                    l.dilation, l.dilation, // dilation (h, w)
                    state.workspace);       // output

As it can be seen the padding arguments of cudnnSetConvolution2dDescriptor() are set to l->pad * l->dilation while the padding for im2col_cpu_ext() and im2col_gpu_ext() are set to l.pad.

As a result, changing the padding arguments of im2col_cpu_ext() and im2col_gpu_ext() from l.pad to l.pad * l.dilation solved the problem and now the result of CPU, GPU without CUDNN and GPU with CUDNN are the same.

However, I don't know why the padding arguments of cudnnSetConvolution2dDescriptor() are l->pad * l->dilation and which of l->pad * l->dilation or l->pad are correct.

Also I'm not sure if similar change should be done on im2col_cpu_ext() and col2im_cpu_ext() in backward_convolutional_layer().

I will issue a pull request if I ensure that the changes are correct. Thanks for your help @AlexeyAB.

As a result, changing the padding arguments of im2col_cpu_ext() and im2col_gpu_ext() from l.pad to l.pad * l.dilation solved the problem and now the result of CPU, GPU without CUDNN and GPU with CUDNN are the same.

Thanks!
l->pad * l->dilation is correct.

Also I'm not sure if similar change should be done on im2col_cpu_ext() and col2im_cpu_ext() in backward_convolutional_layer().

Yes, there should be used l->pad * l->dilation too.

You can make a Pull Request with these changes to this repo. Or I will fix it a bit later.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shootingliu picture shootingliu  Â·  3Comments

jasleen137 picture jasleen137  Â·  3Comments

louisondumont picture louisondumont  Â·  3Comments

HanSeYeong picture HanSeYeong  Â·  3Comments

Mididou picture Mididou  Â·  3Comments