Opencv_contrib: dnn forward calculation is much slower than caffe original version

Created on 25 Dec 2015 · 24Comments · Source: opencv/opencv_contrib

when running VGG_ILSVRC_16_layers model, it takes more than 10 seconds, much slower than caffe original version. Both use cpu only.

dnn (moved out) feature

Source

shengxingdong

Most helpful comment

@shengxingdong in the recent code of the opencv_dnn module "cblass_sgemm" already presents, but the calls of the gemm function are very confusing. It seems that the opencv_dnn is still under development. Fortunatelly I have figured out how to change the sources to get performance boost up to x6.5 (not x40 as mentioned above).

For those who want to try this method. First, let's track the calls chain of gemm. We will pick up the convolutional layer. So, what we have: on forward data propagation it calls dnn::gemm here, than "dnn::gemm" calls "cv::gemm" og "cv::gemmCPU" here, than if "gemmCPU" called we eventually came here. So cblass_gemm will be called if we HAVE_CBLAS (which should be defined at the library build step - this is one of the parameters that controlled from Cmake generation step) and if "flags" variable is not equal to 0. But, lets make step back, what we see? Convolutional layer always calls "dnn::gemm" with zero value of the "flags"! So "cbals_gemm" never called, and we see no performance boost because GEMMInvoker will multiply the matrices (I guess that the GEMMInvoker, that works if flags equals zero, is a designer's attempt to perform matrix multiplication manually, but at present time it works too slow compare to cbals_gemm). So what we should change in sources? First, we should remove or comment GEMMInvoker calls from the gemmCPU. If you rebuild opencv_dnn you should already get the speed up. But! Look at dnn::gemm it could also call "cv::gemm" from the opencv_core module, whitch if you build with Openclwill call ocl_gemm and another strange stuff. cv::gemm will be called only if the layers from your network use opencl. Convolutional layer on default does not use opecl, check here, but fully_connected_layer does, [check here] (https://github.com/opencv/opencv_contrib/blob/master/modules/dnn/src/layers/fully_connected_layer.cpp#L75). For some reasons cv::gemm with Opencl works slower than gemmCPU boosted by cbalss_gemm from the opencv_dnn. So, the second thing, that we should do, is to explicitly set useOpenCL to false in all layers .cpp files that our network use. On the VGG Face Descriptor CNN and Intel Core i5-4210U this steps allows to reduce forward propagation time from 5500 ms to 800 ms (at the same machine caffe.exe for cpu mode shows forward propagation time 924 ms).

P.S. Note that if you build opencv_dnn module in MSVC Visual Studio with MKL you should explicitly make this manual step. I have spend a whole day to empirically find this... So, do not think, as I did, that CMake makes all dirty work for you.

pi-null-mezon on 2 Dec 2016

👍5

All 24 comments

yes,it's really very slow. I test one of my model in caffe about 7X faster than it in opencv DNN.
and it seems no place to load mean_file in opencv DNN, but in the demo of googlenet of opencv DNN,without mean_file,it alse predict the correct result, so the mean_file is useless?

piaobuliao on 29 Dec 2015

👍2

I have problems with dnn::forward() too. It's not that it's slow, but it's litterally crashing with my custom trained model. Is my code okay ? Thanks.

Just submitted the issue [1]

dnn::Net net;
importer->populateNet(net);
importer.release();                     // We don't need importer anymore

Mat img = imread(imageFile);

resize(img, img, Size(224, 224));       // GoogLeNet accepts only 224x224 RGB-images
dnn::Blob inputBlob = dnn::Blob(img);   // Convert Mat to dnn::Blob image batch
net.setBlob(".data", inputBlob);        // Set the network input
net.forward();                          // Compute output (cannot compute output with mine here! ...)
dnn::Blob prob = net.getBlob("prob");   // Gather output of "prob" layer

[1] https://github.com/Itseez/opencv_contrib/issues/517

ghost on 15 Jan 2016

I test the performance on my i7-4770K desktop, and meet the same problem.
I replace the gemm function with MKL gemm function, and get 8.7x performance gain. The running time drops from 480ms to 55.9ms. Both of them are running on single thread.
If you have interest, you can see the modify on my github.
https://github.com/xmchen1987/opencv_contrib/commit/06eed890ccc6aeadbca0844f5b12a4ea41b8c3f7

xmchen1987 on 26 Jan 2016

@xmchen1987 It means to use your modification we have to install MKL Blas. Do you have any solutions for Open Blas?

tofighi on 17 Feb 2016

@tofighi yes, right now yes. I'm working to add support of Eigen, which has beed already added in OpenCV.

xmchen1987 on 18 Feb 2016

👍1

@piaobuliao I find no place to load mean_file too

chapternewscu on 2 Jun 2016

👍1

@shengxingdong @piaobuliao
The main bottleneck is cv::gemm function which is used into Convolution, Deconvolution and FC layers.
Caffe employs gemm operation form heavily optimized BLAS'es (MKL or OpenBLAS or ATLAS) while current cv::gemm implementation is strightforward and weakly optimized.

I replaced cv::gemm to gemm from MKL and OpenBLAS and measure perfomance on Convolution layers. The results is awesome - up to 40x speedup.

ludv1x on 24 Jun 2016

👍5

@ludv1x
So, at the end of the day, can you provide a pure C++ example code that provide classification with the same speed of python?

tofighi on 24 Jun 2016

Hello. In fact this issue is still actual for the MS Windows. Today I've cloned last master-branch snaphots both for the opencv and opencv_contrib repos. Then I have installed latest Intel MKL, in CMake project I've checked opencv_dnn_WITH_BLAS (CMake automatically found MKL path), then successfully build opencv in MS VisualStudio2015 x64. But, after all, the performance in the fresh opencv build was five times worse than in caffe. What I am doing wrong? Do I need to "replace cv::gemm", as mentioned above, manually? Or, maybe, I should check something additional in the CMake project?

pi-null-mezon on 30 Nov 2016

@pi-null-mezon , as @ludv1x said, the bottleneck is cv::gemm. You can replace cv::gemm with cblas_sgemm as @xmchen1987 did.

shengxingdong on 1 Dec 2016

pi-null-mezon on 2 Dec 2016

👍5

@vpisarev @pi-null-mezon Thank you for detailed explanation and investigation.
Behavior (and performance) of all operations using gemm for matrix multiplication is changed after recent commit.

It should be fixed.

ludv1x on 7 Dec 2016

@ludv1x I have worked with the latest sources. After a bunch of experiments I have found better way (than MKL) to speedup opencv_dnn, today it is the simplest way and no additional libraries are needed, also matrix multiplication goes to the GPU. To implement it your opencv should be linked with the opencl, then go to the opencv_contrib/modules/dnn/src/layers, then in all layers that your target DNN use. Find variables that enable opencl, like this useOpencl and set all such flags to true. Build opencv. Now all dnn::gemm calls will be forwarded to ocl_gemm. On my laptop I have Core i5-4210U / AMD Radeon R5 M330 when dnn::gemm works on cbals_gemm forward propagation for the VGG Face Descriptor CNN takes 800ms, whereas when dnn::gemm works on ocl_gemm it takes about 1000 ms. So, here MKL wins. But at a machine where Core i5-6600 / Nvidia Geforce 1070 situation changes to 380 ms / 250 ms, do not forget taht when we use ocl_gemm CPU is unloaded wherease MKL loads CPU up to 100 %. It is interesting that Caffe utility compiled with Nvidia CUDA makes forward propagation on the Geforce 1070 within 20 ms! So if opencv_dnn will be implemented wit cudaBLAS backend, it could drammatically improve the performance.

pi-null-mezon on 8 Dec 2016

std::vector<cv::Mat> transformedInput;
boost::shared_ptr<caffe::Net<float> > net_;
int batchSize_ = 150;
caffe::Caffe::set_mode(caffe::Caffe::GPU);
caffe::Caffe::SetDevice(0);
net_.reset(new caffe::Net<float>(caffeModelTxt, caffe::TEST));
net_->CopyTrainedLayersFrom(caffeModelBin);

// DNN
int totalNum = transformedInput.size();
svmNodesSets.resize(totalNum);
#pragma omp critical(dnn)
{
time = boost::posix_time::microsec_clock::local_time();
int currSize = 0;
caffe::Blob<float>* input_layer = net_->input_blobs()[0];
caffe::BlobProto blob_proto;
blob_proto.set_channels(3);
blob_proto.set_height(kernelSize_.height);
blob_proto.set_width(kernelSize_.width);
blob_proto.clear_data();
for(int n=0; n<totalNum;){

    // Add image
    for (int c = 0; c < 3; ++c) {
        for (int h = 0; h < kernelSize_.height; ++h) {
            for (int w = 0; w < kernelSize_.width; ++w) {
                blob_proto.add_data(transformedInput[n].at<cv::Vec3f>(h, w)[c]);
            }
        }
    }
    n++;
    currSize++;

    // Batch Size Set
    bool finalSet = n == (totalNum);
    if(currSize ==  batchSize_ || finalSet ){

        // Compute
        blob_proto.set_num(currSize);
        input_layer->FromProto(blob_proto);
        net_->Forward();
        boost::shared_ptr<caffe::Blob<float> > layer = net_->blob_by_name("pool5/7x7_s1");
        int vectorSize = layer->count()/currSize;
        const float* layerCPUData = layer->cpu_data();

        // Assign to SVM (Award if you can understand)
        //cout << "start " << n-currSize << " " << " end " << n << endl;
        for(int i=n-currSize, x=0; i<n; i++, x++){
            for(int j=0; j<vectorSize; j++){
                float data = layerCPUData[vectorSize*x + j];
            }
        }

        // Clear Data
        blob_proto.clear_data();
        currSize=0;
        //cout << "Completion: " << ((float)(n)/(float)totalNum)*100 << "%" << endl;
    }
}
}

I highly recommend using Caffe's C++ API for your personal projects over OpenCV DNN for now. I have found it to be the most stable and the fastest C++ GPU based implementation. I've provided sample code for reference on how you might do batch processing on a forward pass of a network and extract a layer from it.

soulslicer on 21 Feb 2017

@ludv1x If it's been fixed (and looks like the fix is in 3.2.0 release), why isn't this issue closed?

kb1ooo on 4 May 2017

@kb1ooo
Of course, it is was fixed.
However, I cannot close PRs.

ludv1x on 9 May 2017

Can you clarify if the latest version runs at par with the Caffe implementation?
Also, does this use the GPU? (might be a naive question but I didn't find the documentation indicating anything explicitly.)

anupamsobti on 23 May 2017

@anupamsobti, last snapshot of the dnn module do not support GPU at all (earlier some layers have opencl backend, but not now). Now you have only one way to speed up calculations, - by means of compiling opencv_dnn with optimized BLAS library: either OpenBLAS or Intel MKL (sligtly faster).

pi-null-mezon on 23 May 2017

Ok. Thanks @pi-null-mezon for clarifying. Can you comment if the CPU-only implementation is equivalent to Caffe's CPU-only implementation?

anupamsobti on 23 May 2017

@anupamsobti if you're asking about equivalency in performance, answer is almost yes if you're using MKL or OpenBLAS. If the question was about code itself - they're similar but not the same. Basically, dnn implementation relies on caffe. But few things cause differences. First, support of other frameworks - there're some layers in caffe and other frameworks, which have the same name but compute results in slightly different ways. Secondly, we've done some optimizations for "the heaviest" layers

arrybn on 23 May 2017

Thanks!

anupamsobti on 23 May 2017

@arrybn Does it work on WIndows?

oostap1 on 24 May 2017

@oostap1 Yes, the dnn module works on Windows

arrybn on 24 May 2017

Hello, sorry for reopen such an old topic, but I'm having trouble speeding up the computation.

I compiled the most recent opencv with MKL I still get slow inference. I build opencv with visual studio so I manually changed the dnn module to to use parallel MKL as suggested before. Am I missing a something else?