Caffe: Batch forward is not concurrent

Created on 12 Apr 2015  路  6Comments  路  Source: BVLC/caffe

_Issue_
Batch forward pass via python (and matlab) interface appears not to work concurrently

_Details_

  1. I performed an experiment where I added N=30 images to the predict method of /python/caffe/classifier.py which calls self.forward_all(**{self.inputs[0]: caffe_in}) to batch forward pass.
  2. I then seperately called predict for these N images, one by one.
  3. The total time to process was similiar in step 1 and step 2. Which suggests forward_all was not concurrent; it appears as if the processing is done one by one.
  4. My expectation is that the predict method, when called with an array of N images, will be done concurrently, e.g. essentially at the time cost of performing forward pass on just one image.
  5. Note that I used a GPU with abundant memory so I do not believe it was related to hardware. I used a standard CaffeNet model.

Most helpful comment

@yuhan210
That is correct. Note that a convolution buffer is needed for the GEMM forward to work. This means if you want to process the whole batch at once, you'd have a huge (n times bigger) convolution buffer and you'd also have to replicate the weights n times.
A more efficient way of doing this is thus using cuDNN, which does not need the intermediate buffer.
I suggest you look at how im2col and col2im work in order to get a better understanding.

All 6 comments

Updating issue with new symptom: it appears this problem also impact the Matlab interface, in addition to the Python interface. Thanks to user "Dinesh" from the Caffe Google Groups for adding to this issue.

Because of the wider impact, this suggests a more fundamental issue may be at play here. Therefore, I'd like to bump this up on @shelhamer radar.

Forward is concurrent up to the batch size. Batching inputs together for GPU processing does improve throughput.

My expectation is that the predict method, when called with an array of N images, will be done
concurrently, e.g. essentially at the time cost of performing forward pass on just one image.

Right, this will be true for a batch size of N. Compare the time to compute forward for 100 images at a time / with batch size 100 vs. doing inference for each 1 image / with batch size 1 in sequence.

For Python and MATLAB performance timing is also conflated with pre-processing time. Resizing images in this way can be slow. That's why there are separate timings in the classification example for (1) prediction and pre-processing and (2) prediction alone.

This is not a bug, but a matter of usage. For concurrent pre-processing or other concurrency beyond the batch size, you can consider a Python data layer with multiprocessing or other Python parallelism.

Thank you, Evan.

Sorry, I'm confused. Are images in a batch processed in parallel? Based on the Forward_gpu function in conv_layer.cu, there's a for loop iterating through (num_) each element in a batch. It seems to be sequential, not concurrent.

@yuhan210
That is correct. Note that a convolution buffer is needed for the GEMM forward to work. This means if you want to process the whole batch at once, you'd have a huge (n times bigger) convolution buffer and you'd also have to replicate the weights n times.
A more efficient way of doing this is thus using cuDNN, which does not need the intermediate buffer.
I suggest you look at how im2col and col2im work in order to get a better understanding.

@naibaf7
Thank you!
When you say cuDNN does not need intermediate buffer, what does that exactly mean?

Was this page helpful?
0 / 5 - 0 ratings