_Issue_
Batch forward pass via python (and matlab) interface appears not to work concurrently
_Details_
Updating issue with new symptom: it appears this problem also impact the Matlab interface, in addition to the Python interface. Thanks to user "Dinesh" from the Caffe Google Groups for adding to this issue.
Because of the wider impact, this suggests a more fundamental issue may be at play here. Therefore, I'd like to bump this up on @shelhamer radar.
Forward is concurrent up to the batch size. Batching inputs together for GPU processing does improve throughput.
My expectation is that the predict method, when called with an array of N images, will be done
concurrently, e.g. essentially at the time cost of performing forward pass on just one image.
Right, this will be true for a batch size of N. Compare the time to compute forward for 100 images at a time / with batch size 100 vs. doing inference for each 1 image / with batch size 1 in sequence.
For Python and MATLAB performance timing is also conflated with pre-processing time. Resizing images in this way can be slow. That's why there are separate timings in the classification example for (1) prediction and pre-processing and (2) prediction alone.
This is not a bug, but a matter of usage. For concurrent pre-processing or other concurrency beyond the batch size, you can consider a Python data layer with multiprocessing or other Python parallelism.
Thank you, Evan.
Sorry, I'm confused. Are images in a batch processed in parallel? Based on the Forward_gpu function in conv_layer.cu, there's a for loop iterating through (num_) each element in a batch. It seems to be sequential, not concurrent.
@yuhan210
That is correct. Note that a convolution buffer is needed for the GEMM forward to work. This means if you want to process the whole batch at once, you'd have a huge (n times bigger) convolution buffer and you'd also have to replicate the weights n times.
A more efficient way of doing this is thus using cuDNN, which does not need the intermediate buffer.
I suggest you look at how im2col and col2im work in order to get a better understanding.
@naibaf7
Thank you!
When you say cuDNN does not need intermediate buffer, what does that exactly mean?
Most helpful comment
@yuhan210
That is correct. Note that a convolution buffer is needed for the GEMM forward to work. This means if you want to process the whole batch at once, you'd have a huge (n times bigger) convolution buffer and you'd also have to replicate the weights n times.
A more efficient way of doing this is thus using cuDNN, which does not need the intermediate buffer.
I suggest you look at how im2col and col2im work in order to get a better understanding.