Opencv: GpuMat as input/output to cv::dnn::Net

Created on 25 Jan 2020  路  37Comments  路  Source: opencv/opencv

System information (version)

  • OpenCV => 4.2.0
APIs that need changes

Suitable APIs for GpuMat input:

  1. void setInput(InputArray blob, const String& name = "", double scalefactor = 1.0, const Scalar& mean = Scalar());

Suitable APIs for GpuMat output:

  1. void forward(OutputArrayOfArrays outputBlobs, const String& outputName = String());
  2. void forward(OutputArrayOfArrays outputBlobs, const std::vector<String>& outBlobNames);

Modifications to support GpuMat:

  1. void blobFromImage(InputArray image, OutputArray blob, double scalefactor=1.0, const Size& size = Size(), const Scalar& mean = Scalar(), bool swapRB=false, bool crop=false, int ddepth=CV_32F);

Might also require a new API that would allow a stream to be specified (required if blobFromImage needs to be performed asynchronously).

The CUDA backend uses fully packed tensors everywhere and expects a fully packed tensor as input. GpuMat given as input might be using a 2D memory layout which may not be contiguous. blobFromImage must return a contiguous copy of the input.

The outputs of the CUDA backend are also in fully packed tensors. Hence, for further optimal use of the output (in the case it's an image) would require a transformation to 2D memory layout.

Problems:
  1. GpuMat cannot store an arbitrary number of dimensions. How can the batch of images be passed?
RFC dnn gpcuda (contrib) few weeks feature future

Most helpful comment

I'll pick this issue up full time over the Thanksgiving holiday.

All 37 comments

@YashasSamaga Thank you for collecting overview of this problem! Good job!


GpuMat cannot store an arbitrary number of dimensions

Right, current GpuMat implementation supports 2D layout only.
Problem is that existed CUDA API code expects 2D layout only - need to avoid massive refactoring of code (there are tons of user code which we are not able to fix anyway - compatibility problem).


new API that would allow a stream to be specified

Consider using of dedicated cuda (cv:dnn::cuda) namespace.
Also we can introduce thread-local "default" stream API.

@alalek @YashasSamaga Is this issue still open for a PR? I would like to work on it.

  1. GpuMat cannot store an arbitrary number of dimensions. How can the batch of images be passed?

Though I am very new to this side of OpenCV. However, may be we can use row major order at the time of indexing containers with arbitrary number of dimensions and in the memory we can store these containers as linear arrays. Please let me know if I have misinterpreted something wrong here.

@czgdp1807 and how do you plan to store the shape? The DNN module would require a 4D+ tensor and you need to pass this shape information somehow through GpuMat.

Thanks for the response.
IMO, we can store the shape as another linear array. For example, GpuMat can have two data pointers, data_type* shape, and another one through data_type* data_gpu. We can use static methods to initialise both the shape and dimension pointers. This approach can generalise the number of dimensions in the data. Please let me know your views on this.
P.S. - I will do more research on the current architecture of GpuMat and come up with something that can be integrated more easily. I will analyse how the API suggestions in the description of this issue can used to support arbitrary number of dimensions.

I don't have any strong opinions on how GpuMat should be designed or improved (I do not know enough about GpuMat to discuss it anyway).

The fundamental problem with GpuMat currently is that you cannot pass arbitrary dimension tensors (even if the actual memory layout is linear, 2D or 3D). Once this feature is implemented, it's quite easy to support GpuMat as input and output.

Thanks for letting me know. I will look into the current code and API and will try to come up with a new API design for the current and the new use case of arbitrary number of dimensions.

Hi,
I was just going through the code and observed that _InputArray supports 2D containers only. Is there any container already implemented that supports multidimensional arrays, something like Tensor in tensorflow? If not, then may be we can refactor the code to support arbitrary dimensions in basic containers(like _InputArray, _OutputArray) without changing the interface to avoid repeated code.
What do you say?

_InputArray supports 2D containers only

This is not correct. See .sizend() method.
Main problem is GpuMat - it is 2D only.

_InputArray is not a container (in terms of storing buffers). It is "unified" wrapper which stores type/kind and a pointer on real container object. Also it doesn't even hold container (doesn't extend lifetime through reference counters), so it should not be stored anywhere (except local stack).


refactor the code

Please don't try to start changing of such "core" interfaces. It is bad idea in general (we don't want to break users code).

Sure, so probably we need changes only in GpuMat. Or Is supporting arbitrary dimension in GpuMat feasible in the current state. May be, we can just work on the APIs suggested in the OP.

@czgdp1807

Excerpt from alalek's comment https://github.com/opencv/opencv/issues/16433#issuecomment-578436958

Right, current GpuMat implementation supports 2D layout only.
Problem is that existed CUDA API code expects 2D layout only - need to avoid massive refactoring of code (there are tons of user code which we are not able to fix anyway - compatibility problem).

The CUDA backend of the DNN module operates on fully packed tensors (i.e. 1D memory irrespective of the logical shape of the tensor).

Before you can begin working on the APIs suggested in the OP, GpuMat needs to support arbitrary shape with 1D memory layout. The issue currently at hand is that the existing codebase (non-DNN stuff like cuda::resize) assumes that GpuMat data is stored in 2D pitched memory. Hence, when arbitrary dimension support on 1D memory is added to GpuMat, it would require updating the existing codebase to deal with new GpuMat need not necessarily store data in 2D pitched memory. Otherwise, you'll have algorithms operating on 1D memory thinking it's 2D memory which could cause unexpected problems. One way to go about would be to add checks to enforce the 2D layout requirement wherever necessary.

Adding arbitrary dimension support isn't very hard but updating the entire codebase is quite a lot of work.

Adding arbitrary dimension support isn't very hard but updating the entire codebase is quite a lot of work.

Agreed. That's why I asked for feasibility and infact it's now clear to me that the APIs suggested in OP need arbitrary dimension support so first step would be to do that.

One way to go about would be to add checks to enforce the 2D layout requirement wherever necessary.

This is reminding me of a scenario where if-else checks were added in every method of the class to separate handling of two use cases. This suggests that we can create a new class for arbitrary dimension support and use it wherever required. Another alternative is for whatever new methods are added from OP should use the arbitrary dimension support and the already existing are using the 2D layout. We can make changes to the existing method where ever required to use the arbitrary dimension 1D layout. I am assuming that algorithms right now using GpuMat in OpenCV are using 2D layout and we will add new code(not added yet) for the algorithms which will use the arbitrary dimension and in that case what to use from GpuMat.

Hi,
I was thinking about this issue. I have the following two API designs in mind for multidimensional support in GpuMat.

Overloading functions
We can overload functions of GpuMat which will support multiple dimensions. For example,
currently one of the constructors of GpuMat is,

GpuMat(int rows, int cols, int type, Allocator* allocator = defaultAllocator());

Here we have to give, rows, cols and type. We can replace rows and cols with something like, MatShape(I came to know about this while working on a recent PR) or a data pointer, like, int* shape.

GpuMat(MatShape shape, type, Allocator* allocator = defaultAllocator());

or,

GpuMat(int* shape, type, Allocator* allocator = defaultAllocator());

Creating a separate class
To keep the current implementation of GpuMat untouched we can create a new class, Tensor which will suit the purpose of arbitrary dimensions. We can keep it something like torch.tensor.

Tensor(int* shape, int type, Allocator* allocator = defaultAllocator());

May be tensors can go to opencv_contrib till they become more mature.
Please let me know your views. Thanks.

Greetings people. I'm looking for a tesseract ocr coding about find duplicate numbers in an image and marking them. For example I have numbers listed from 1 to 99 in an image and i dont want any number to repeat more than once, but in my paperworks there are some repetitions (like two "65" randomly in the paper like 5 23 65 86 65 12 33 21 etc.) and i want to discard those repeated paperworks. In real life it is a huge struggle to read them all can you guide me sir? I want to make an app for mobile phone to use that with camera is it possible? Or is there an existing program like this scans and finds dublicate numbers/texts? Please God help me I'm so much in a mess...

Perhaps separate new class GpuMatND ('ND' states for n-dimensional) should work great.

Tensor

is a more common name (may conflict with other existed containers in OpenCV or should be designed perfectly to unify all of them)

I would like to make a PR with the interface of GpuMatND and will continue with adding implementation for the same. Please let me know if it's feasible and where should I make the PR, to opencv or opencv_contrib. Thanks.

Wouldn't it be easier to just reuse the existing ndarray/tensor type cv::dnn::cuda4dnn::csl::Tensor and support copies from/to GpuMat?

blob_in = cv.cuda_GpuBlobFromImage(gpu_mat, stream)
blob_in = cv.cuda_GpuBlobFromArray([gpu_mat1, gpu_mat2, gpu_mat3], stream=stream)

net.setInput(blob)
blob_out = net.forward(stream=stream)

gpu_mat_out = cv.cuda_GpuMatFromBlob(blob_out)

The blob type itself could be a completely opaque handle with no public interface to commit to, akin to the existing OpenCV concept of creating a "blob" as input argument from existing Mat types.

@pwuertz cv::dnn::cuda4dnn::csl::Tensor belongs to the CUDA backend's internal API and is not exposed to the end-user. Exposing Tensor through public API will require exposing several other classes from the internal API which doesn't seem like a great idea.

We can maybe use a new lightweight proxy which takes the role of Tensor from blobFromImage to net.forward and vice-versa. But this will complicate the currently straightforward simple-to-use DNN API (which is I think is one of the biggest pros of OpenCV DNN).

@YashasSamaga Thats what I meant by "opaque blob type", a lightweight, i.e. no or minimal public interface handle, internally backed by your Tensor.

Why do you think this will complicate the existing DNN API? The example above looks like a normal OpenCV DNN pattern to me.

Besides, doesn't this project require an alternative API anyway due to the CUDA specific stream argument?

@pwuertz Are you trying to say that we can create, GpuMatND as a wrapper for accessing the functionality of Tensor?

I was proposing that initially, we don't need to design / define any new type at all. For this specific task, an opaque blob handle you can convert from/to Mat/GpuMat would be sufficient. Internally this could be backed by Tensor.

Generally, it would be really nice to have a GpuMatND, and OpenCV shifting more towards ndarray containers instead of 2d focused data objects. But as mentioned by alalek, maybe this is a topic best tackled in conjunction with a generalized OpenCV tensor / ndarray API?

Is there any update on the progress of this?

@MonocleSecurity I am working on it in #16666
I am just waiting for some comments to get a direction in which I should proceed.

I think we should use a lightweight opaque proxy object instead of GpuMatND. The GpuMat (even cv::Mat) interfaces are quite old. Maybe in the near feature, OpenCV will get new C++11 based interface for all these containers. Maybe it's not the right time to add a new container. This is just my opinion.

The proxy object is quite simple to make. It's an RAII object which essentially contains a smart pointer and a shape array. That's it.

We can do it. To the best of my understanding, the proxy object will be a wrapper to GpuMat which uses 2D layout internally, so supporting multi-dimensional arrays will again be problem. May be we can make a proxy to Tensor(with limited access) as suggested in https://github.com/opencv/opencv/issues/16433#issuecomment-591408101

@czgdp1807 You just need one single block of linear memory.

class Proxy {
public:

private:
GpuMat data; // 1D memory
std::vector<int> shape; // store shape here
};

I got your point. Thanks for the hint, GpuMat can be used for 1D memory by keeping either row or col 1. Will update my already open PR with Proxy object.

@czgdp1807 I assume it must be a single row (i.e. not a single column) though. GpuMat does impose certain row-stride constraints like requiring multiple of 512 or something to that effect, right?

@czgdp1807 are you still working on this?

I am keen to see this feature added, I posted a bounty but I am unsure if I am allowed to link to it.

Here is a link to the bounty if anyone is interested: https://www.bountysource.com/issues/87547327-gpumat-as-input-output-to-cv-dnn-net

It's best if we could come up with a plan before anyone begins to write the code. There would be changes in the public API. It will have to be approved by the maintainers. This is my proposal.

A new header #include <opencv2/dnn/cuda.hpp> will be added which won't be included by default by dnn.hpp. All the CUDA specific API will be added to its own namespace in cv::dnn::cuda.

proxy object for cuda4dnn::csl::Tensor:

// public API
namespace cv { namespace dnn { namespace cuda {

    class GpuBlob {
    public:
        GpuBlob();
        ~GpuBlob();

    private:
        friend class GpuBlobAccessor;

        struct C4DTensorProxy;
        std::unique_ptr<C4DTensorProxy> proxy;
    };

}}};

// internal implementation
template <class T>
using Tensor = cuda4dnn::csl::Tensor<T>;

struct GpuBlob::C4DTensorProxy {
    std::variant<Tensor<float>, Tensor<half>> tensor; // or some other type-safe union
};

blobFromImage(s) and ImagesFromBlob

void blobFromImage(const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
                                  const Size& size = Size(), const Scalar& mean = Scalar(),
                                  bool swapRB=false, bool crop=false, int ddepth=CV_32F);

void blobFromImage(const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
                                  const Size& size = Size(), const Scalar& mean = Scalar(),
                                  bool swapRB=false, bool crop=false, int ddepth=CV_32F);

void imagesFromBlob(const cv::dnn::cuda GpuBlob& blob, OutputArrayOfArrays images);

We need to provide a way to specify the stream. I think the best way would be to duplicate the three APIs and add a stream argument. This seems to be more user-friendly than having a default argument.

void blobFromImage(cv::cuda::Stream& stream, const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
                                  const Size& size = Size(), const Scalar& mean = Scalar(),
                                  bool swapRB=false, bool crop=false, int ddepth=CV_32F);

void blobFromImage(cv::cuda::Stream& stream,const cv::cuda::GpuMat& image, cv::dnn::cuda::GpuBlob& blob, double scalefactor=1.0,
                                  const Size& size = Size(), const Scalar& mean = Scalar(),
                                  bool swapRB=false, bool crop=false, int ddepth=CV_32F);

void imagesFromBlob(cv::cuda::Stream& stream, const cv::dnn::cuda GpuBlob& blob, OutputArrayOfArrays images);

The declarations will be present in opencv2/dnn/cuda.hpp. The implementation will be in dnn.cpp which is a wrapper around the actual implementation. The actual implementation should be inside cuda4dnn directory. The dnn.cpp implementation will invoke the actual implementation depending on whether the DNN module was built with the CUDA backend. Something like:

// dnn.cpp

#ifdef HAVE_CUDA
#include "cuda4dnn/blobFromImages.hpp"
#endif

void blobFromImages(InputArrayOfArrays images_, cuda::GpuBlob& blob, ...)
{
    CV_TRACE_FUNCTION();

#ifdef HAVE_CUDA
    std::vector<cv::cuda::GpuMat> images;
    images_.getGpuMatVector(images);
    cuda4dnn::blobFromImages(images, blob, ...);
#else
    CV_Error(Error::GpuNotSupported, "DNN module was not built with CUDA backend");
#endif
}

We could use cudaimgproc module for resizing images but I think we should roll our own kernel for the DNN module to avoid a new dependency (maybe we will get rid of the opencv_contrib dependency altogether in the future; see #17488 ). There are resize kernels in the CUDA backend which are used by ResizeLayer. We might be able to use it as it is or maybe as a template for a new resize kernel.

We could also require new kernels that can perform scaling, mean subtraction and channel swap. It would be best to fuse all these together into one kernel. The operation is memory bound and hence we will probably have enough compute resources to accommodate all the calculations (even if redundant).

Currently, with the FP16 target, the first operation is to convert the input (which is always transferred to the GPU in FP32 format) to FP16 on the GPU. We can fuse this operation with the previous step. The first step during inference would then reduce to copying the tensor in GpuBlob to the input blob. We could directly use the GpuBlob tensor as input as well.

Memory allocations are costly in CUDA. We should avoid any allocation at all costs. We should allocate the memory just once if GpuBlob is empty. The end-user can reuse the same blob. We should also avoid any synchronization in the stream-based APIs.

setInput

void cv::dnn::Net::setInput(const cuda::GpuBlob& blob, const String& name = "");

forward

Mat forward(const String& outputName = String()); // already exists; should work with GPU inputs too and return output in cv::Mat
void forward(OutputArrayOfArrays outputBlobs, const String& outputName = String()); // same as above

void forward(GpuBlob& blob, const String& outputName = String());
void forward(std::vector<GpuBlob>& outputBlobs, const String& outputName = String()); // same as above
void forward(cv::cuda::Stream& stream, GpuBlob& blob, const String& outputName = String());
void forward(cv::cuda::Stream& stream, std::vector<GpuBlob>& outputBlobs, const String& outputName = String()); // same as above

The stream-based API should not cause any synchronization. This will allow users to seamlessly incorporate inference inside their CUDA pipeline.

Example

cv::cuda::Stream stream;
cv::cuda::GpuMat image_d, output_d;

auto net = cv::dnn::readNet("data/yolov4/yolov4.cfg", "data/yolov4/yolov4.weights");
net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA);

cv::dnn::cuda::GpuBlob blob_input;
cv::dnn::cuda::GpuBlob blob_output;

{
    // CUDA stuff

    cv::dnn::cuda::blobFromImage(stream, image_d, blob_input);
    net.setInput(blob_input);
    net.forward(stream, blob_output);
    cv::dnn::cuda::imagesFromBlob(stream, blob_output, output_d);

   // CUDA stuff

   // fully asynchronous block will allow DNN inference to be placed nicely into the end-user's stream
}

There is still a problem with the above proposal. In the imagesFromBlob step, we are still limited to GpuMat's capabilities. If we compact detection outputs to a 1D block, you would still have trouble storing the shape.

One solution would be to have member functions that return the shape in GpuBlob but all this will probably make the end-user's code look nasty. We could add some of cv::Mat's shape API to GpuBlob. It will still look awful though.

I am very interested in seeing this feature being added, so I have added another $300 bounty for this feature for a total of $400.

https://www.bountysource.com/issues/87547327-gpumat-as-input-output-to-cv-dnn-net

I'll pick this issue up full time over the Thanksgiving holiday.

Hi @kuberlog any progress?

I think it would be nice to have a more general GpuMat instead of patchwork in the DNN module. The more general GpuMat has many more usecases (postprocessing or NMS on inference results on GPU for example).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

fnoop picture fnoop  路  35Comments

nji9nji9 picture nji9nji9  路  44Comments

opencv-pushbot picture opencv-pushbot  路  60Comments

edelsohn picture edelsohn  路  48Comments

lucaspojo picture lucaspojo  路  34Comments