Dlib: QUESTION : is yolov3 possible in DLIB

Created on 10 Oct 2020  路  83Comments  路  Source: davisking/dlib

I am trying to define yolov3 using dlib's dnn module.
I'm stuck with the darknet53 backbone, as I want it to output the outputs of the last three layers.
So far i have this:

using namespace dlib;

template <int outc, int kern, int stride, typename SUBNET> 
using conv_block = leaky_relu<affine<con<outc,kern,kern,stride,stride,SUBNET>>>;

template <int inc, typename SUBNET>
using resblock = add_prev1<conv_block<inc,3,1,conv_block<inc/2,1,1,tag1<SUBNET>>>>;

template<int nblocks, int outc, typename SUBNET>
using conv_resblock = repeat<nblocks, resblock<outc,
                      conv_block<outc, 3, 2, SUBNET>>>;

template<typename SUBNET>
using darknet53 = tag3<conv_resblock<4, 1024,
                  tag2<conv_resblock<8, 512,
                  tag1<conv_resblock<8, 256,
                  conv_resblock<2, 128,
                  conv_resblock<1, 64,
                  conv_block<32, 3, SUBNET
                  >>>>>>>>>;

Is it possible for darknet53 to output tag1, tag2 and tag3?

Most helpful comment

Yes absolutely, It's unlicensed. I see this as an investment.

All 83 comments

Actually, having the tags there is enough to get me going and define the rest of the network. BUT, yolov3 has three yolo layers. So unless i apply grid offsets, anchors, permute dimensions, and concatenate everything at the end, the network will have to output three tensors anyway at the end.

I have never tried it but, if you don't want to do all this reshaping and concatenation madness, and you know the tags of the layers you're interested in, I guess you can always access them directly from the the loss layer by doing something like this:

template <
    typename const_label_iterator,
    typename SUBNET
    >
double compute_loss_value_and_gradient (
    const tensor& input_tensor,
    const_label_iterator truth, 
    SUBNET& sub
) const
{
    const tensor& out1 = layer<tag1>(sub).get_output();
    const tensor& out2 = layer<tag2>(sub).get_output();
    const tensor& out3 = layer<tag3>(sub).get_output();
    // ...
}

And then apply the yolo layer to each output.

Do i need a loss layer if i'm only interested in inference? My goal is to port the weights from darknet to a dlib-defined yolov3 network. If not, can i just tag the output layers I want, then forward the input at the front of the network, then get the outputs i want using layer<tagx>(sub).get_output() ?

Right. You would write your loss function so it goes and grabs the tags you are interested in.

But if you don鈥檛 want to train then yeah. Just access the later you want and look at its outputs.

Just noticed that repeat only takes template<typename> class as the repeated layer. So it's not letting me use it with resblock as it has template <int inc, typename SUBNET> as template signature. Have i missed something?

All the examples that use repeat have the template<typename> class signature

Yes, the repeat layer only takes a template <typename SUBNET> class. You can have a look at my definition of the Darknet53 bacbkbone here, where I predefine some things to be able to use them with the repeat layer.

So i have this so far:

using namespace dlib;

template <template <typename> class BN>
struct yolo
{
    template <int outc, int kern, int stride, typename SUBNET> 
    using conv_block = leaky_relu<BN<con<outc,kern,kern,stride,stride,SUBNET>>>;

    template <int outc, typename SUBNET>
    using resblock = add_prev1<conv_block<outc,3,1,conv_block<outc/2,1,1,tag1<SUBNET>>>>;

    template <typename SUBNET> using res1024 = resblock<1024,SUBNET>;
    template <typename SUBNET> using res512  = resblock<512,SUBNET>;
    template <typename SUBNET> using res256  = resblock<256,SUBNET>;
    template <typename SUBNET> using res128  = resblock<128,SUBNET>;

    template <typename SUBNET> using block5 = repeat<4,res1024, conv_block<1024,3,2,SUBNET>>;
    template <typename SUBNET> using block4 = repeat<8,res512,  conv_block<512,3,2,SUBNET>>;
    template <typename SUBNET> using block3 = repeat<8,res256,  conv_block<256,3,2,SUBNET>>;
    template <typename SUBNET> using block2 = repeat<2,res128,  conv_block<128,3,2,SUBNET>>;
    template <typename SUBNET> using block1 = resblock<64,conv_block<64,3,2,SUBNET>>;

    using darknet53 = tag1<block5<
                      tag2<block4<
                      tag3<block3<
                      block2<
                      block1<
                      conv_block<32,3,1, 
                      input_rgb_image
                      >>>>>>>>>;    

    template<int outc, int nclasses, int tag, int yolo_tag, typename SUBNET>
    using detection_block = add_tag_layer<yolo_tag, con<3*(nclasses + 5), 1, 1, 1, 1,   //conv7 - yolo output
                            conv_block<outc,   3, 1,                                    //conv6
                            add_tag_layer<tag, conv_block<outc/2, 1, 1,                 //conv5 - branch output
                            conv_block<outc,   3, 1,                                    //conv4
                            conv_block<outc/2, 1, 1,                                    //conv3
                            conv_block<outc,   3, 1,                                    //conv2
                            conv_block<outc/2, 1, 1,                                    //conv1
                            SUBNET
                            >>>>>>>>>;

    template<int nclasses>
    using yolov3 =
            detection_block<256,nclasses,8,12,  //8 is the branch tag (don't care here), 12 is a yolo tag
            concat2<skip7, skip3,                 //concat last layer with tag3 from darknet backbone
            tag7<upsample<2,
            conv_block<128, 1, 1,
            skip6<
            detection_block<512,nclasses,6,11,  //6 is the branch tag, 11 is a yolo tag
            concat2<skip5, skip2,                 //concat last layer with tag2 from darknet backbone
            tag5<upsample<2,
            conv_block<256, 1, 1,
            skip4<                              //pick branch with tag 4
            detection_block<1024,nclasses,4,10, //4 is the branch tag, 10 is a yolo_tag
            skip1<
            darknet53
            >>>>>>>>>>>>>>;
};

This compiles. That's progress. The API is hurting my brain a bit though.

@arrufat @davisking Is there a way to turn bias off in conv_block. Since conv_block has a batchnormalisation layer, which already has a bias term, we don't want double biases.

@arrufat @davisking Is there a way to turn bias off in conv_block. Since conv_block has a batchnormalisation layer, which already has a bias term, we don't want double biases.

Yes! That feature was added not that long ago in #2156, you just do:

set_all_bn_inputs_no_bias(net);

And it will do it automatically for the whole network.

Will it do the same to affine layers ?

I'm not training, simply porting weights from darknet. So i don't need to use bn_con layers.

concat_ layers need tags as inputs. It compiles for me with this change:

concat2<tag7, tag3, SUBNET

Cheers thank you. Getting closer to working.

Will it do the same to affine layers ?

No, that visitor only works with bn_ layers that have either con_ or fc_ layers as inputs.

Cheers thank you. Getting closer to working.

I am very interested in this if you manage to deserialize the darknet weights and make them work with dlib.

Also, check the paddings of the 3x3 convolutions with a stride of 2. They are 0 by default in dlib, but they need to be 1 in yolo.
That is why I defined this.

Presumably, to port the weights, i will have to use a visitor?

Is there a layer for permuting dimensions? Can extract be used? I need to go from a tensor of shape 1x255x13x13 to 1x3x85x13x13, then to 1x13x13x3x85.

Also to get the exact same results as darknet, we need a layer similar to upsample that uses a "nearest" method, not bilinear interpolation.

Presumably, to port the weights, i will have to use a visitor?

Yes, at least that's how I would approach it, in particular I would use visit_layers_backwards.

Is there a layer for permuting dimensions?

You can try with extract_ + some extra manipulation.

I have a WIP project where I try to implement YOLOv1 (as a start) but haven't been very active lately. You can check it out:
https://github.com/arrufat/yolo-dlib

EDIT: it's still WIP and it doesn't work, although the training runs...

I can do reshaping, applying grid offset and anchors post processing using pointer arithmetics and stuff. The only thing left to do is porting weights. This is all an experiment to benchmark yolov3 with dlib. Defining a loss function for yolov3 in dlib is going to be too hard and you can train in darknet or pytorch anyway.

oh and there is the disabling of biases in affine layers, and implemented a "nearest" method for upsample layer. so 3 things to do.

You can disable bias for affine layers easly using the new style visitor with a lambda.

Ok. Is there an example of this? Also is there a way of setting avg_red, avg_green and avg_blue for input_rgb_image layer?

Actually, just seen the code for bn_conv, looks fine.

Ok. Is there an example of this? Also is there a way of setting avg_red, avg_green and avg_blue for input_rgb_image layer?

https://github.com/davisking/dlib/blob/a1f158379e2f328e8697b63ad653926594c8a771/examples/dnn_dcgan_train_ex.cpp#L135

You should read the documentation of the input layers.

It's possible i've missed something in the docs for the input layers. I've just used this instead:

struct input_rgb_image_zero_means : input_rgb_image
    {
        input_rgb_image_zero_means() : input_rgb_image(0,0,0) {}
    };

You should also read dnn_introduction2_ex. You will learn that you can initialize the layers of a network by passing them when constructing the network, like this:

net_type net(input_rgb_image(0, 0, 0));

ah ok fair enough. Though using input_rgb_image_zero_means means it's impossible to use it incorrectly.

@arrufat The API doesn't expose the layer parameters reliably. Indeed the get_layer_params function for the affine_ layer spits back empty_params. So i can't set the weights using get_layer_params. It looks like i have to serialize some weights to a temporary stream then call deserialize on that layer using that stream. What do you think?

I have the following visitor:

struct darknet_visitor
{
    darknet_visitor(const char* darknet_weights)
    :   w(darknet_weights, std::ios::binary)
    {
        assert(w.is_open());
        int32_t major, minor, dummy;
        int64_t dummy2;
        w >> major >> minor >> dummy;
        if ((major * 10 + minor) >= 2 && major < 1000 && minor < 1000)
            w >> dummy2;
        else
            w >> dummy;
        cout << "weights file major " << major << " minor" << minor << endl;
    }

    template<typename T>
    void operator()(size_t idx, T& t)
    {
    }

    template <typename SUBNET>
    void operator()(size_t idx, add_layer<affine_, SUBNET>& l)
    {
        cout << "affine layer " << idx << endl;
        auto& bn    = l.layer_details();
        auto& conv  = l.subnet().layer_details();
        //1. bn bias
        //2. bn weight
        //3. bn running mean
        //4. bn running var
        //5. conv weight
//        tensor& bn_t = bn.get_layer_params(); //THIS IS EMPTY BECAUSE affine_t spits back empty_params
        stringstream ss;
        ss << ...;
        deserialize(bn.get_layer_params(), ss);
        ss << ...;
        deserialize(conv.get_layer_params(), ss);
    }

    template <
        long outc,
        long nr,
        long nc,
        int sy,
        int sx,
        int py,
        int px,
        typename SUBNET
        >
    void operator()(size_t idx, add_layer<con_<outc,nr,nc,sy,sx,py,px>,SUBNET>& l)
    {
        auto& conv = l.layer_details();

        if (!conv.bias_is_disabled())
        {
            cout << "con layer " << idx << endl;
            //1. conv bias
            //2. conv weight
            stringstream ss;
            ss << ...;
            deserialize(conv.get_layer_params(), ss);
        }
    }

    std::ifstream w;
};

which i call using:

visit_layers_backwards(net, darknet_visitor("yolov3.weights"));    

It looks like the get_layer_params() for bn_ and con_ return params. So maybe, i have to first define a model using bn_con, then do the porting of weights, then replace all the bn_con layers with affine.
Hmm, getting complicated.

After declaring the network, you can forward some dummy input to initialize the params of the layers, then run the visitor.

but that still doesn't solve the problem with affine_. Do you have to use bn_ first to port the weights? Also, since all the weights are alias_tensor types that use params for storage, and get_layer_params returns params, it's not entirely obvious how to port the weights to params. Do you suggest using serialize and deserialize ? Or maybe there should be new functionality added to the dnn module to make all this possible. For example have a port_weights visitor type, designed for this use case, which is made a friend type for all layers. Then we can have access to all the underlying tensors.

I would not use serialize/deserialize for this. I would do something like:

auto& params = l.get_layer_params();
float* p = params.host();

And then read the weights from the yolo file and store them in p. However, I did not check in which order the weights are stored in darknet, you have to check that and skip or reshape to your needs.

yep so i know exactly how the weights are stored in darknet format. The problem with what you suggest is that using auto& params = l.get_layer_params(); for affine_ layer will not work since it returns an empty tensor that is never used

Furthermore params is used as a storage tensor. The actual weights inside the layer classes are all alias_tensor types. So setting params correctly is very difficult.

alias_tensor is just a view into the tensor, to be able to access the weights from the convolution kernel and the biases more easily for example. But everything is stored in the same tensor returned by get_layer_params(), as far as I know.
If you initialize the network with a dummy input, then the .get_layer_params() for the affine layer should not be empty, and if you print its values, you will see some ones (gamma) followed by some zeros (beta).
https://github.com/davisking/dlib/blob/a1f158379e2f328e8697b63ad653926594c8a771/dlib/dnn/layers.h#L2166-L2185

If you look inside layers.h, you will see this:

const tensor& get_layer_params() const { return empty_params; }
        tensor& get_layer_params() { return empty_params; }

for affine_

And empty_params is never set

Oh, I skipped that, so then you need to define yolo changing the template parameter to bn_con, load the weights and then assign it to the yolo model declared with affine:

yolo<bn_con>::yolov3 net;

// visitor

yolo<affine>::yolov3 net2(net);

And in the visitor you should initialize the missing values from yolo to something sensible.

Ok thought so. That's what i was going on about a few comments ago. Thank you.

In my head affine had a learnable gamma and beta, but it turns out it doesn't, sorry about that.

This is what i have so far. It compiles but doesn't work.
Porting the weights shows that the correct number of bytes is read from the file. So it looks like the network structure is correct and the interpretation of the weights is correct. But could be wrong. Maybe there's an error with endianness. Not sure.
Please try it and see if you can spot the errors.

main.cpp.txt

Warning: it takes roughly 60 seconds to compile main.cpp. Sigh...

I have been able to build it and run it, but at a first glance I didn't see anything odd... I'll have another look later. Thanks for sharing :)

The detections are all wrong. So a bit stuck as to where the errors are. Model size is correct, and the visitor is reading the correct number of bytes. @arrufat if you find a fix, please post

Possibly need to inspect the output of every layer and compare side by side with either darknet or pytorch implementation.

template <
            layer_mode bnmode
            >
        affine_(
            const bn_<bnmode>& item
        )
        {
            gamma = item.gamma;
            beta = item.beta;
            mode = bnmode;

            params.copy_size(item.params);

            auto g = gamma(params,0);
            auto b = beta(params,gamma.size());

            resizable_tensor temp(item.params);
            auto sg = gamma(temp,0);
            auto sb = beta(temp,gamma.size());

            g = pointwise_divide(mat(sg), sqrt(mat(item.running_variances)+item.get_eps()));
            b = mat(sb) - pointwise_multiply(mat(g), mat(item.running_means));
        }

Why is this happening:

g = pointwise_divide(mat(sg), sqrt(mat(item.running_variances)+item.get_eps()));
b = mat(sb) - pointwise_multiply(mat(g), mat(item.running_means));

??

This could be my problem. I can't set running_variances or running_means since get_layer_params in bn_con only gives me gamma and beta.

Ok. Fixed it. Had to manually adjust gamma and beta using running_variances and running_means. All works now.
here is the code:

main.cpp.txt

Now if someone can write the training code with a loss function that uses GIOU, DIOU and CIOU losses, that would be great :) :) (@arrufat ??)
Implementing GIOU and company in a framework that supports auto-grad is trivial. Since in dlib you have to manually write the backward passes, I'm likely to make some mistakes with all those derivatives.

That's great news, I've tried it and it works. Thanks for sharing your progress. I might give a go to implement the loss function at some point.

Also, to reduce the memory usage and data transfer time, you might want to resize the image ouside the network:

const size_t img_size = 416;
matrix<rgb_pixel> img, scaled(img_size, img_size);
load_image(img, "/path/to/image.jpg");
resize_image(img, scaled);
net(scaled);

Then replace the resize_to layer in the network with a tag layer (so that you can get the l.subnet().get_output() required by the visitor).

By doing that I went from:

  • FPS: from 47 to 48 (darknet 64 fps)
  • VRAM: 1147MB from to 1145MiB (darknet: 857 MiB)
    It doesn't change that much for the dog image, but for larger images, it will be more noticeable.

I've updated the code here, and added COCO labels:
main.cpp

Hmm, FPS is a bit pants. At least it works and people can use it if they don't want to depend on heavy frameworks.

It would be cool if there was some training code. Computing all the derivatives for CIOU loss and getting it right will be a nightmare though.

Should we keep the issue open until training is supported?

@davisking Maybe this could be added to the examples in the documentation, provided there is some cleanup and some comments.

Maybe. It would need a fair bit of commenting and cleanup. Like the point of the examples is to be basically an essay that helps users understand how to use the tooling. So for instance, it's probably better if the example was simpler and showed the small number of salient details needed to be able to do this, rather than doing it for a big yolo model. The reader will likely get lost in the details otherwise.

Is it possible to do batch inference in C++? Would that help with FPS since you're doing fewer loads to the GPU?

On CPU, linking to openblas, using -O3 -ffast-math, it takes roughly 1.7s per 3x416x416 image. That's almost 10 times slower than opencv. I don't know much about the inner workings of the dnn module in dlib, but it feels like there is a big bottleneck somewhere. @davisking Any ideas?

Yeah I know it鈥檚 slow. The cpu path could be a lot faster. It鈥檚 doing something really basic. Dlib鈥檚 DNN tooling is really meant to be used on a GPU.

@davisking Ok thank you. I just wanted to make sure i wasn't doing something stupid

Just my two cents, but having a visitor_darknet_weights in dlib might be useful, since the visitor doesn't need to change that much. One just needs to have the correct network definition in dlib and call it. I have managed to use YOLOv4 with dlib thanks to @pfeatherstone's visitor :) Thank you!

@arrufat Do you mind sharing the dlib-defined network type for yolov4 ?

Also in the visitor, when porting weights for a bnmatrix instead of tensor for bias, gamma, running_mean and running_variance since they are temporary and get converted to matrix types anyway. So you can do this instead. This is just a bit of cleanup:

//bn bias
        matrix<float> temp_b(1, num_b);
        for (size_t i = 0 ; i < num_b ; i++)
            (*this) >> temp_b(i);

        //bn weights
        matrix<float> temp_g(1, num_b);
        for (size_t i = 0 ; i < num_b ; i++)
            (*this) >> temp_g(i);

        //bn running mean
        matrix<float> temp_m(1, num_b);
        for (size_t i = 0 ; i < num_b ; i++)
            (*this) >> temp_m(i);

        //bn running var
        matrix<float> temp_v(1, num_b);
        for (size_t i = 0 ; i < num_b ; i++)
            (*this) >> temp_v(i);

        g = pointwise_divide(temp_g, sqrt(temp_v+DEFAULT_BATCH_NORM_EPS));
        b = temp_b - pointwise_multiply(mat(g), temp_m);

        //conv weight
        auto& conv = l.subnet().layer_details();
        tensor& con_t = conv.get_layer_params();
        assert(conv.bias_is_disabled());

        const size_t num_w = con_t.size();
        float* ptr = con_t.host();
        for (size_t i = 0 ; i < num_w ; i++)
            (*this) >> ptr[i];

Probs doesn't change much speed-wise, but if this is eventually going to make it into some documentation one day so people can learn, use and improve, then this is a tiny first step.

Of course, I was wondering if it's OK for you if I put it into the dlib-users organisation, along with your visitor.
I've made a small example that can use the input from a camera or video file and you can choose between yolov3, yolov4, and yolov4-sam-mish.

Yes absolutely, It's unlicensed. I see this as an investment.

Of course, I was wondering if it's OK for you if I put it into the dlib-users organisation, along with your visitor.
I've made a small example that can use the input from a camera or video file and you can choose between yolov3, yolov4, and yolov4-sam-mish.

Might be worth adding yolov3-spp as it's a simple addition of some max_pool layers and concat, and gives you much better performance, almost on par with yolov4

Yes, YOLOv4 has that (by the way, YOLOv4-SAM-Mish takes over than 3 minutes to build, as there are over than 400 layers)

Doing that quickly now

@pfeatherstone Feel free to modify at will https://github.com/dlib-users/darknet.

I added the YOLOv4 architectures yesterday quickly, so I am not using the repeat layer a lot... Those architectures need a bit of refactoring to repeat the common parts, and be able to reuse as much as possible between YOLO models.

Cool. Also quick shout out to yolov5 family of models. Those are all trained with one of GIOU, DIOU and CIOU loss. Consequently the bounding boxes are really tight. So they are my personal favorite.

@arrufat I don't want to interfere with your repositories too much because of different styles and so on. I don't want to be stepping on anyone's toes. So i'll post things here and there in comments for now if that's ok.

Here is my spp stuff:

    template <typename SUBNET> using tagx1  = add_tag_layer<1000+1, SUBNET>;
    template <typename SUBNET> using tagx2  = add_tag_layer<1000+2, SUBNET>;
    template <typename SUBNET> using tagx3  = add_tag_layer<1000+3, SUBNET>;
    template <typename SUBNET> using tagx4  = add_tag_layer<1000+4, SUBNET>;
    template <typename SUBNET> using tagx5  = add_tag_layer<1000+5, SUBNET>;
    template <typename SUBNET> using skipx1 = add_skip_layer<tagx1, SUBNET>;
    template <typename SUBNET> using skipx2 = add_skip_layer<tagx2, SUBNET>;
    template <typename SUBNET> using skipx3 = add_skip_layer<tagx3, SUBNET>;
    template <typename SUBNET> using skipx4 = add_skip_layer<tagx4, SUBNET>;
    template <typename SUBNET> using skipx5 = add_skip_layer<tagx5, SUBNET>;

    template<int outc, int nclasses, int tag, int yolo_tag, typename SUBNET>
    using detection_block = add_tag_layer<yolo_tag, con<3*(nclasses + 5), 1, 1, 1, 1,   //conv7 - yolo output
                            conv_block<outc,   3, 1,                                    //conv6
                            add_tag_layer<tag, conv_block<outc/2, 1, 1,                 //conv5 - branch output
                            conv_block<outc,   3, 1,                                    //conv4
                            conv_block<outc/2, 1, 1,                                    //conv3
                            conv_block<outc,   3, 1,                                    //conv2
                            conv_block<outc/2, 1, 1,                                    //conv1
                            SUBNET
                            >>>>>>>>>;

    template<int outc, int nclasses, int tag, int yolo_tag, typename SUBNET>
    using detection_block_spp = add_tag_layer<yolo_tag, con<3*(nclasses + 5), 1, 1, 1, 1,   //conv7 - yolo output
                                conv_block<outc,   3, 1,                                    //conv6
                                add_tag_layer<tag, conv_block<outc/2, 1, 1,                 //conv5 - branch output
                                conv_block<outc,   3, 1,                                    //conv4
                                conv_block<outc/2, 1, 1,                                    //spp - conv
                                concat4<tagx4, tagx3, tagx2, tagx1,
                                tagx4<max_pool<13,13,1,1,skipx1<                            //spp - c
                                tagx3<max_pool< 9, 9,1,1,skipx1<                            //spp - b
                                tagx2<max_pool< 5, 5,1,1,skipx1<                            //spp - a
                                tagx1<conv_block<outc/2, 1, 1,                              //conv3
                                conv_block<outc,   3, 1,                                    //conv2
                                conv_block<outc/2, 1, 1,                                    //conv1
                                SUBNET
                                >>>>>>>>>>>>>>>>>>>>>;

    template<int nclasses, template<int, int, int, int, typename> class Detection_block>
    using yolov3_ =
            detection_block<256,nclasses,8,12,  //8 is the branch tag (don't care here), 12 is a yolo tag
            concat2<tag7, tag3,                 //concat last layer with tag3 from darknet backbone
            tag7<upsample<2,
            conv_block<128, 1, 1,
            skip6<
            detection_block<512,nclasses,6,11,  //6 is the branch tag, 11 is a yolo tag
            concat2<tag5, tag2,                 //concat last layer with tag2 from darknet backbone
            tag5<upsample<2,
            conv_block<256, 1, 1,
            skip4<                              //pick branch with tag 4
            Detection_block<1024,nclasses,4,10, //4 is the branch tag, 10 is a yolo_tag
            skip1<
            darknet53
            >>>>>>>>>>>>>>;

    template<int nclasses>
    using yolov3 = yolov3_<nclasses,detection_block>;

    template<int nclasses>
    using yolov3_spp = yolov3_<nclasses,detection_block_spp>;

I am not particularly proud of that code, so please change it :)

Yeah I know it鈥檚 slow. The cpu path could be a lot faster. It鈥檚 doing something really basic. Dlib鈥檚 DNN tooling is really meant to be used on a GPU.

@davisking do you plan on improving CPU performance some time in the future? Or is that low priority ?

It鈥檚 a low priority.

* FPS: from 47 to 48 (darknet 64 fps)

@arrufat Any ideas why this is the case? I would have thought the convolutions take up most of the compute and those are taken care of by CUDNN. The same argument applies to darknet which also uses CUDNN. So there must be a bottleneck outside CUDNN...

Not yet, I've been playing around with this, but still haven found why. Also VRAM usage is a bit higher in dlib, and I saw there are some small fluctuations, so maybe some copies are happening somewhere. Sadly I didn't have the time to test it thoroughly. Also, my fps in dlib include painting the "nice" bounding boxes on top of the images using OpenCV... Maybe that's slower than what darknet is using, or maybe we're not even comparing the same thing...

EDIT: sorry, my measurement does _not_ include the bounding box drawing, only detection + nms.

Also, compiling yolov3, yolov3-spp, yolov4 and yolov4-leaky in a single cpp file requires > 16GB of RAM. clang++ crashed with SIGKILL due to out of memory. So I think i experienced a template explosion for the first time.

It鈥檚 a low priority.

@davisking do you have ideas of what would be required to boost inference speed such that it matched frameworks like onnxruntime and darknet? Would it require a lot of implementation re-design or massive concurrency or something else ? I'm quite excited to use dlib for dnn stuff if it's performance was comparable to other frameworks.

It looks like cpu::tensor_conv could be improved by having different code paths depending on strides, padding, groups and dilations. @davisking mentioned that some of the convolutions in image_processing are much faster since they they make assumptions on these parameters, like stride==1, padding==nfilters//2, groups==1 and dilation==1. so maybe depending on the parameters, different functions could be called.

Also, is col2img a bottleneck? I wouldn't think so but never know.

Concurrency-wise, i think onnxruntime computes different nodes of the computational graphs on different threads if possible using a threadpool. @davisking would that also be on a roadmap ?

Using a similar concept to https://github.com/taskflow/taskflow maybe?

Yeah col2img is a huge bottleneck. For convs with large strides it is sort of ok to use col2img and BLAS, but really the whole pattern of col2img is not good for performance. I did it because it's a normal thing people were doing and it's easy to get all the conv operations going that way, and I didn't really care about CPU performance. The CPU codepath is there just as a fallback and for testing really. Mainly this is meant for execution on the GPU.

To make this fast a whole new conv implementation is needed and it would definitely have to have separate code paths for different conv settings.

Anyway, I'm not personally interested in writing a faster CPU conv. You could upgrade what's here by deferring to dlib::spatially_filter_image() and it's related functions where appropriate. Those are pretty performant conv functions that use SIMD instructions. But they only handle some types of convolution. So if you want to make a PR that adds those things as separate call paths when the conv settings allow that would be cool.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abhisheksoni27 picture abhisheksoni27  路  4Comments

ardamavi picture ardamavi  路  3Comments

reunanen picture reunanen  路  3Comments

pliablepixels picture pliablepixels  路  4Comments

lvella picture lvella  路  4Comments