Caffe: OpenCL-branch, GPU mode freezes but CPU mode works

Created on 23 Nov 2016 · 10Comments · Source: BVLC/caffe

Dear community members,

I'm using caffe-opencl in Linux Mint 17 with AMD HD7470 1G, RAM 8G. I've built it twice but with different output when running. And now the GPU mode freezes.

In the first build, I was following compilation instruction from caffe-master pages (http://caffe.berkeleyvision.org/install_apt.html)
I succeeded compiling but when running on LeNet using sh examples/mnist/train_lenet_rmsprop.sh (I've already transformed the data into lmdb using the script),
it kept popping out three below lines from data_layer.cpp, without loss.

prefetch time 1ms
data time 0.4ms
transform time 0.5ms

(The second build) make runtest shows that NetTest/0.TestSharedWeightsUpdate failed.
Later I found the opencl compilation needs clBLAS in (https://github.com/amd/OpenCL-caffe/wiki)
I followed the instruction and built clBLAS. But I then got confused of how to change BLAS/ATLAS to clBLAS using cmake. So I rm * for all the old built and built another time using the same parameters (still using ViennaCL as BLAS). This time when I ran I didn't get into loops of endless prefetch time output. However when I ran LeNet the GPU load raises to 99% in few seconds and it froze at

I1123 17:56:16.679487 14177 caffe.cpp:278] Starting Optimization
I1123 17:56:16.679507 14177 solver.cpp:299] Solving LeNet
I1123 17:56:16.679520 14177 solver.cpp:300] Learning Rate Policy: inv
I1123 17:56:16.685703 14177 solver.cpp:358] Iteration 0, Testing net (#0)

The CPU mode runs ok.

Could be there something I neglected? Is there any suggestions for the installation? I now realized the second built instruction is from another repo (but I didn't change the configuration).

Below is the output of caffe device_query

I1123 18:17:56.751840 14380 common.cpp:373] Total devices: 2
I1123 18:17:56.752209 14380 common.cpp:374] CUDA devices: 0
I1123 18:17:56.752226 14380 common.cpp:375] OpenCL devices: 2
I1123 18:17:56.752238 14380 common.cpp:399] Device id:                     0
I1123 18:17:56.752250 14380 common.cpp:401] Device backend:                OpenCL
I1123 18:17:56.752280 14380 common.cpp:403] Backend details:               Advanced Micro Devices, Inc.: OpenCL 2.0 AMD-APP (1598.5)
I1123 18:17:56.752300 14380 common.cpp:405] Device vendor:                 Advanced Micro Devices, Inc.
I1123 18:17:56.752315 14380 common.cpp:407] Name:                          Caicos
I1123 18:17:56.752328 14380 common.cpp:409] Total global memory:           536870912
I1123 18:17:56.752341 14380 common.cpp:399] Device id:                     1
I1123 18:17:56.752352 14380 common.cpp:401] Device backend:                OpenCL
I1123 18:17:56.752367 14380 common.cpp:403] Backend details:               Advanced Micro Devices, Inc.: OpenCL 2.0 AMD-APP (1598.5)
I1123 18:17:56.752379 14380 common.cpp:405] Device vendor:                 GenuineIntel
I1123 18:17:56.752391 14380 common.cpp:407] Name:                          Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
I1123 18:17:56.752403 14380 common.cpp:409] Total global memory:           8325865472

OpenCL question

Source

kbxu

All 10 comments

Oh I found out the Makefile is a lot easier than the cmake method, and I've got it work using clBLAS. BTW ViennaCL still freezes. I think probably it's my GPU needs to be upgraded. I'm closing it. The issues can be solved using Makefile.

kbxu on 23 Nov 2016

@kbxu
Maybe you want to enable USE_LIBDNN in the Makefile for faster speed.
Additionally, you can try if this ViennaCL fixes the issue: https://github.com/viennacl/viennacl-dev
Finally yes, clBLAS is the fastest BLAS for your AMD GPU anyways :)

naibaf7 on 23 Nov 2016

@naibaf7
I replaced ViennaCL to dev version and uncommented USE_LIBDNN in Makefile.config.
I got stopped at some errors when making libdnn_conv.cpp

In file included from src/caffe/greentea/libdnn_conv.cpp:6:0:
src/caffe/greentea/libdnn_conv.cpp: In instantiation of ‘caffe::LibDNNConv<Dtype>::Tune(Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, int32_t) [with Dtype = float; int32_t = int]::__lambda15’:
src/caffe/greentea/libdnn_conv.cpp:1861:7:   required from ‘struct caffe::LibDNNConv<Dtype>::Tune(Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, int32_t) [with Dtype = float; int32_t = int]::__lambda15’
src/caffe/greentea/libdnn_conv.cpp:1859:3:   required from ‘void caffe::LibDNNConv<Dtype>::Tune(Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, Dtype*, int32_t) [with Dtype = float; int32_t = int]’
src/caffe/greentea/libdnn_conv.cpp:1941:1:   required from here
./include/caffe/greentea/libdnn.hpp:104:8: error: ‘bool caffe::LibDNN<Dtype>::CompileKernels() [with Dtype = float]’ is protected
   bool CompileKernels();
        ^
src/caffe/greentea/libdnn_conv.cpp:1862:50: error: within this context
       return self->LibDNN<Dtype>::CompileKernels();
                                                  ^

kbxu on 24 Nov 2016

Also, the GPU seems to be slower than CPU in 100 iterations. (in the no libDNN build)

I1124 11:17:48.136464 14473 common.cpp:373] Total devices: 2
I1124 11:17:48.136653 14473 common.cpp:374] CUDA devices: 0
I1124 11:17:48.136667 14473 common.cpp:375] OpenCL devices: 2
I1124 11:17:48.136674 14473 common.cpp:399] Device id:                     0
I1124 11:17:48.136684 14473 common.cpp:401] Device backend:                OpenCL
I1124 11:17:48.136701 14473 common.cpp:403] Backend details:               Advanced Micro Devices, Inc.: OpenCL 2.0 AMD-APP (1598.5)
I1124 11:17:48.136714 14473 common.cpp:405] Device vendor:                 Advanced Micro Devices, Inc.
I1124 11:17:48.136725 14473 common.cpp:407] Name:                          Caicos
I1124 11:17:48.136735 14473 common.cpp:409] Total global memory:           536870912
I1124 11:17:48.136745 14473 common.cpp:399] Device id:                     1
I1124 11:17:48.136754 14473 common.cpp:401] Device backend:                OpenCL
I1124 11:17:48.136765 14473 common.cpp:403] Backend details:               Advanced Micro Devices, Inc.: OpenCL 2.0 AMD-APP (1598.5)
I1124 11:17:48.136773 14473 common.cpp:405] Device vendor:                 GenuineIntel
I1124 11:17:48.136782 14473 common.cpp:407] Name:                          Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
I1124 11:17:48.136791 14473 common.cpp:409] Total global memory:           8325865472

caffe time -model=examples/mnist/lenet_train_test.prototxt -iterations 100 -gpu 1
...
I1124 11:16:08.494714 14445 caffe.cpp:465] Average Forward pass: 68.3438 ms.
I1124 11:16:08.494748 14445 caffe.cpp:467] Average Backward pass: 100.068 ms.
I1124 11:16:08.494762 14445 caffe.cpp:469] Average Forward-Backward: 168.602 ms.
I1124 11:16:08.494776 14445 caffe.cpp:471] Total Time: 16860.2 ms.
I1124 11:16:08.494789 14445 caffe.cpp:472] *** Benchmark ends ***

caffe time -model=examples/mnist/lenet_train_test.prototxt -iterations 100 -gpu 0
...
I1124 11:17:13.445297 14455 caffe.cpp:465] Average Forward pass: 152.661 ms.
I1124 11:17:13.445358 14455 caffe.cpp:467] Average Backward pass: 123.195 ms.
I1124 11:17:13.445390 14455 caffe.cpp:469] Average Forward-Backward: 276.549 ms.
I1124 11:17:13.445415 14455 caffe.cpp:471] Total Time: 27654.9 ms.
I1124 11:17:13.445439 14455 caffe.cpp:472] *** Benchmark ends ***

kbxu on 24 Nov 2016

Benchmark of the master version without OpenCL is even better. I'm not sure what makes opencl version slower in my case.

(pulled master branch and build with CPU_ONLY := 1)

./build/tools/caffe time -model=examples/mnist/lenet_train_test.prototxt -iterations 100
I1124 11:57:08.851390 16116 caffe.cpp:412] Average Forward pass: 29.4132 ms.
I1124 11:57:08.851403 16116 caffe.cpp:414] Average Backward pass: 42.3777 ms.
I1124 11:57:08.851414 16116 caffe.cpp:416] Average Forward-Backward: 71.84 ms.
I1124 11:57:08.851425 16116 caffe.cpp:418] Total Time: 7184 ms.
I1124 11:57:08.851435 16116 caffe.cpp:419] *** Benchmark ends ***

kbxu on 24 Nov 2016

Sorry to tell you, but the performance seems correct.
The HD7450 is a HD6450, so a VLIW chip not supported anymore by AMD. Its peak performance would be 240 GFLOP, and it is also limited by slow memory. I think we can get at most 80 GFLOP out of it (maybe 160 with LibDNN) while your CPU with vector (SSE3/AVX) instructions can probably do 200 GFLOPs in Caffe.
To use GPUs i would recommend at least a R9 280X/380X or RX 460 with GCN architecture (they are at around 2000 GFLOP).
Well at least the HD7450 should be power efficient with around 15W consumption :)

To make your CPU faster with OpenCL, use Intel OpenCL and OpenBLAS or MKL. Currently you use AMD OpenCL on your Intel chip, which is bad. However also here you won't be faster than 200 GFLOP.

I'll look into why LibDNN is not compiling for you.

naibaf7 on 24 Nov 2016

😄1

Gratitude. BTW my gcc/g++ version is 4.8.4

kbxu on 24 Nov 2016

@kbxu This should be fixed.

naibaf7 on 6 Dec 2016

@naibaf7 Awesome! I'll try it in few days.

kbxu on 7 Dec 2016

@kbxu Remember to use OpenBLAS/MKL + Intel OpenCL SDK for the CPU and LibDNN + AMD FGLRX driver for the AMD GPU :)

naibaf7 on 7 Dec 2016

😄1