Hi,
I would like to apply yolov3 to 2 images for detection.
I found a code snippet from issue #483 . But the API has changed a lot since then.
Could you show how to do that?
thx.
@AlexeyAB @saihv
+1.
I reckon YOLO could benefit from such an implementation as exemplified here. Following #483 has led me to a brick wall - stuck at figuring out the right 'stepping', none of these work:
- net.h*net.w*3
- l.h*l.w*l.n
- l.h*l.w*l.n*(l.classes+l.coords+1)
- l.output+(l.w*l.h*l.c)
The last bit is from saihv's. l.c is always '0' in my case anyway...
Here's what I have so far:
net.batch = imgs.size();
float *X = (float *)calloc(net.batch*net.h*net.w * 3, sizeof(float));
for (int i = 0; i < net.batch; ++i) {
image im;
im.c = imgs[i].c;
im.data = imgs[i].data;
im.h = imgs[i].h;
im.w = imgs[i].w;
image sized;
if (net.w == im.w && net.h == im.h) {
sized = make_image(im.w, im.h, im.c);
memcpy(sized.data, im.data, im.w*im.h*im.c * sizeof(float));
}
else
sized = resize_image(im, net.w, net.h);
memcpy(X + i * net.h*net.w * 3, sized.data, net.h*net.w * 3 * sizeof(float));
}
float *prediction = network_predict(net, X);
layer l = net.layers[net.n - 1];
box *boxes = (box *)calloc(l.w*l.h*l.n, sizeof(box));
float **probs = (float **)calloc(l.w*l.h*l.n, sizeof(float *));
for (int j = 0; j < l.w*l.h*l.n; ++j) probs[j] = (float *)calloc(l.classes, sizeof(float *));
std::vector<std::vector<bbox_t>> bbox_vec_batch;
for (int j = 0; j < net.batch; ++j) {
std::vector<bbox_t> bbox_vec;
get_region_boxes(l, 1, 1, thresh, probs, boxes, 0, 0);
if (nms) do_nms(boxes, probs, l.w*l.h*l.n, l.classes, nms);
for (int k = 0; k < l.w*l.h*l.n; ++k) {
int const obj_id = max_index(probs[k], l.classes);
float const prob = probs[k][obj_id];
if (prob > thresh) {
bbox_t bbox;
if (boxes[k].w > 1) {
bbox.x = 0;
bbox.w = imgs[j].w;
}
else {
float w = boxes[k].w * imgs[j].w;
bbox.x = round(boxes[k].x * imgs[j].w - w / 2);
bbox.w = w;
}
if (boxes[k].h > 1) {
bbox.y = 0;
bbox.h = imgs[j].h;
}
else {
float h = boxes[k].h * imgs[j].h;
bbox.y = round(boxes[k].y * imgs[j].h - h / 2);
bbox.h = h;
}
bbox.obj_id = obj_id;
bbox.prob = prob;
bbox.track_id = 0;
bbox_vec.push_back(bbox);
}
}
bbox_vec_batch.push_back(bbox_vec);
l.output += 0; //unsolved stepping mystery
}
free(boxes);
free_ptrs((void **)probs, l.w*l.h*l.n);
free(X);
In any of the 4 proposed solutions, the output is only valid for the first element in the batch. Hence, 'stepping' is highly likely to be the missing ingredient. For the record, as far as batch-scalability is concerned, based on my personal benchmarking on SSD, evidently there's a considerable performance hike, moving up from Quadro K4200 to GeForce GTX 1080.
It seems the current version of yolov3 cannot detect 2 images simultaneously.
Could multi-thread technique be a solution if I would like to speed up the detection for 2 images? Any reference related? thx. @AlexeyAB
How about some hack using opencv or something like - concatenate both the images to a single image, detect and split the image?
@panda9095
It seems the current version of yolov3 cannot detect 2 images simultaneously.
Could multi-thread technique be a solution if I would like to speed up the detection for 2 images? Any reference related? thx. @AlexeyAB
I think no. Only big batch can significantly accelerate detection.
@kmsravindra
How about some hack using opencv or something like - concatenate both the images to a single image, detect and split the image?
I think this will reduce the accuracy. Field of vision of each final activation will see context (part) of other images.
@jstumpin
The last bit is from saihv's.
l.cis always '0' in my case anyway...
l.output += 0; //unsolved stepping mystery
Use l.outputs
Try to use l.output = l.output + l.outputs; instead of l.output = l.output + (l.w*l.h*l.c);
My project is to detect two ROI of a image simultaneously and in real time. The desirable fps is about 10. But the model runs about 50~60ms for one ROI. Would multi-thread prediction work in this case? thx. @AlexeyAB
@panda9095 I think yes.
Just try to run two instances of Darknet yolo in 2 separate terminals on the same PC.
@AlexeyAB
In addition to shifting these:
box *boxes = (box *)calloc(l.w*l.h*l.n, sizeof(box));
float **probs = (float **)calloc(l.w*l.h*l.n, sizeof(float *));
for (int j = 0; j < l.w*l.h*l.n; ++j) probs[j] = (float *)calloc(l.classes, sizeof(float *));
.
.
.
free(boxes);
free_ptrs((void **)probs, l.w*l.h*l.n);
free(X);
inside/outside of the loop, using l.outputs too does not bear well. On the contrary, it yields unstable outcome (sometimes zero detection, sometimes a bunch of noisy detections) for every execution of the program. On the other hand, with l.w*l.h*l.c I would get the first element of the batch correct (the rest are just clones of the first, hence wrong detections), but at least the accuracy is persistent every time.
Finally got it working.
From YOLODLL_API Detector::Detector constructor of yolo_v2_class.cpp of yolo_cpp_dll project, specify the batch size (tried _net.batch = batch_size_ elsewhere, didn't work):
net = parse_network_cfg_custom(cfgfile, batch_size);
Add/edit the following lines in bold to the previous post:
float *X = (float *)calloc(net.batch*net.h*net.w * 3, sizeof(float));
for (int i = 0; i < net.batch; ++i) {
image im;
im.c = imgs[i].c;
im.data = imgs[i].data;
im.h = imgs[i].h;
im.w = imgs[i].w;
image sized;
if (net.w == im.w && net.h == im.h) {
sized = make_image(im.w, im.h, im.c);
memcpy(sized.data, im.data, im.w*im.h*im.c * sizeof(float));
}
else
sized = resize_image(im, net.w, net.h);
memcpy(X + i * net.h*net.w * 3, sized.data, net.h*net.w * 3 * sizeof(float));
free(sized.data); //fixed memory leak
}
float *prediction = network_predict(net, X);
layer l = net.layers[net.n - 1];
box *boxes = (box *)calloc(l.w*l.h*l.n, sizeof(box));
float **probs = (float **)calloc(l.w*l.h*l.n, sizeof(float *));
for (int j = 0; j < l.w*l.h*l.n; ++j) probs[j] = (float *)calloc(l.classes, sizeof(float *));
std::vector<std::vector<bbox_t>> bbox_vec_batch;
for (int j = 0; j < net.batch; ++j) {
std::vector<bbox_t> bbox_vec;
get_region_boxes(l, 1, 1, thresh, probs, boxes, 0, 0);
if (nms) do_nms(boxes, probs, l.w*l.h*l.n, l.classes, nms);
for (int k = 0; k < l.w*l.h*l.n; ++k) {
int const obj_id = max_index(probs[k], l.classes);
float const prob = probs[k][obj_id];
if (prob > thresh) {
bbox_t bbox;
if (boxes[k].w > 1) {
bbox.x = 0;
bbox.w = imgs[j].w;
}
else {
float w = boxes[k].w * imgs[j].w;
bbox.x = round(boxes[k].x * imgs[j].w - w / 2);
bbox.w = w;
}
if (boxes[k].h > 1) {
bbox.y = 0;
bbox.h = imgs[j].h;
}
else {
float h = boxes[k].h * imgs[j].h;
bbox.y = round(boxes[k].y * imgs[j].h - h / 2);
bbox.h = h;
}
bbox.obj_id = obj_id;
bbox.prob = prob;
bbox.track_id = 0;
bbox_vec.push_back(bbox);
}
}
bbox_vec_batch.push_back(bbox_vec);
l.output += l.h*l.w*l.n*(l.classes + l.coords + 1); //fixed stepping issue
}
free(boxes);
free_ptrs((void **)probs, l.w*l.h*l.n);
free(X);
Benchmarked on 200 samples (second last column and last column represent average and total run time in second, respectively):
NVIDIA Quadro K4200:
batch size = 2
cpu 8.25% (1.25)
mem 3110.154MB
predict 100 0.136870 13.687000
loadimg 200 0.004135 0.827000
main 1 14.531000 14.531000
batch size = 1
cpu 8.44% (2.23)
mem 2485.166MB
predict 200 0.080695 16.139000
loadimg 200 0.004330 0.866000
main 1 17.021000 17.021000
NVIDIA Geforce GTX 1080:
batch size = 2
cpu 9.07% (5.01)
mem 3818.476MB
predict 100 0.034790 3.479000
loadimg 200 0.003320 0.664000
main 1 4.158000 4.158000
batch size = 1
cpu 8.36% (5.65)
mem 3186.131MB
predict 200 0.022865 4.573000
loadimg 200 0.004425 0.885000
main 1 5.471000 5.471000
Thanks @AlexeyAB @panda9095 @saihv
@jstumpin Doesn't bother to normalize the image data like the pj version do?
im.data[kwh+iw+j] = data[istep+j*c+k]/255
@wait1988
We do bothered. It's being normalized via imdecode (if one uses OpenCV) or load_image_stb (if otherwise).
@jstumpin Ok,I see,I'll try it now.
@jstumpin I use the pj version,and add the above code you provided.It doesn't work.
CUDA Error:invalid argument.
@wait1988
Not sure how it work on the original repo@pj version but it's quite hard to get a CUDA error with the above code. The worse you can get is either wrong detections or no detection at all (you can reproduce such mishaps by setting parse_network_cfg_custom(cfgfile, 1) or enabling set_batch_network(&net, 1) despite loading >1 images; in the constructor of yolo_v2_class.cpp). The code will break though (program will stall, but no CUDA error) if preassigned network batch size != current image batch size (e.g.: final image batch < network batch).
The code will break though (program will stall, but no CUDA error) if preassigned network batch size != current image batch size (e.g.: final image batch < network batch).
The cause for the above glitch:
https://github.com/pjreddie/darknet/issues/915#issue-336229064
The potential solution:
set_batch_network(&net, batch_size);
prior to:
float *X = (float *)calloc(net.batch*net.h*net.w * 3, sizeof(float));
where batch_size is the number of current image batch.
set_batch_network accordingly:@@ -362,7 +362,9 @@ void set_batch_network(network *net, int b)
net->layers[i].batch = b;
#ifdef CUDNN
if(net->layers[i].type == CONVOLUTIONAL){
cudnn_convolutional_setup(net->layers + i, cudnn_fastest);
layer *l = net->layers + i;
cudnn_convolutional_setup(l, cudnn_fastest);
l->workspace_size = get_workspace_size(*l);
Thanks @fsaxen @AlexeyAB
Finally got it working.
From YOLODLL_API Detector::Detector constructor of yolo_v2_class.cpp of yolo_cpp_dll project, specify the batch size (tried net.batch = batch_size elsewhere, didn't work):
The solution by @jstumpin works for yolov2. But for yolov3, the strategy to compute bboxes from network output is quite different. It depends on not only net.layers[net.n-1], but also other layers with type YOLO(this layer type only exists in yolov3).
My solution to work with batch detection on yolov3 is as follows. All input images have been resized(to network size) and normalized.
// assume channel 3
// img_ptrs is of type std::vector< std::shared_ptr<image_t> > to
// properly transfer image data.
float *X = (float*)calloc(net.batch*net.w*net.h*3,sizeof(float));
for(int i=0;i<net.batch;i++)
{
image im;
im.c = img_ptrs[i]->c;
im.w = img_ptrs[i]->w;
im.h = img_ptrs[i]->h;
im.data = img_ptrs[i]->data;
image sized;
if(net.w==im.w && net.h==im.h)
{
sized = make_image(im.w,im.h,im.c);
memcpy(sized.data, im.data, im.w*im.h*im.c*sizeof(float));
}
else sized = resize_image(im, net.w, net.h);
memcpy(X+i*net.h*net.w*3, sized.data, net.h*net.w*3*sizeof(float));
free(sized.data);
}
// predict
network_predict(net, X);
// layer l = net.layers[net.n-1];
// get bbox
std::vector< std::vector<bbox_t> > bbox_vec_batch;
for(int j=0;j<net.batch;j++)
{
int nboxes = 0;
int letterbox = 0;
float hier_thresh = 0.5;
int nms=0.4;
detection* dets = get_network_boxes(&net,img_ptrs[j]->w,img_ptrs[j]->h,
thresh, hier_thresh,
0,1,&nboxes, letterbox);
do_nms_sort(dets, nboxes, l.classes,nms);
std::vector<bbox_t> bbox_vec;
for(int i=0;i<nboxes;++i)
{
box b = dets[i].bbox;
const int obj_id = max_index(dets[i].prob, l.classes);
const float prob = dets[i].prob[obj_id];
if(prob>thresh) // thresh is given
{
bbox_t bbox;
bbox.x = std::max((double)0,(b.x-b.w/2.)*img_ptrs[j]->w);
bbox.y = std::max((double)0,(b.y-b.h/2.)*img_ptrs[j]->h);
bbox.w = b.w*img_ptrs[j]->w;
bbox.h = b.h*img_ptrs[j]->h;
bbox.obj_id = obj_id;
bbox.prob = prob;
bbox.track_id = 0;
bbox_vec.push_back(bbox);
}
}
bbox_vec_batch.push_back(bbox_vec);
free_detections(dets, nboxes);
// stepping
for(int j=0;j<net.n;j++)
{
layer& temp_l = net.layers[j];
if(temp_l.type==YOLO || temp_l.type==REGION || temp_l.type==DETECTION)
{
// temp_l.output += temp_l.h*temp_l.w*temp_l.n*(temp_l.classes + temp_l.coords + 1);
temp_l.output = temp_l.output + temp_l.outputs;
}
}
}
for(int j=0;j<net.n;j++) // reset layer output pointer for free.
{
layer& temp_l = net.layers[j];
if(temp_l.type==YOLO || temp_l.type==REGION || temp_l.type==DETECTION)
{
for(int i=0;i<net.batch;i++)
temp_l.output = temp_l.output - temp_l.outputs;
}
}
if(X)
free(X);
The main body of my solution is similar to that of @jstumpin 's solution. And thanks @jstumpin @AlexeyAB For sharing their code.
My solution (above) works, but when I test it on multiple images, only the first image gives perfect detection. Same thing happens to @jstumpin 's solution. Could someone explain that?
Sorry I haven't been following this all along, but:
In the snippet by @jstumpin, l.output += l.h*l.w*l.n*(l.classes + l.coords + 1);
In YOLOv2, l.n*(l.classes + l.coords + 1) should be the same as l.c (see here); which is what I originally used in my version. And note that in v2, l.outputs is essentially the same as l.h*l.w*l.c.
In YOLOv3, definition of l.outputs changes between YOLO, region and detection layers, may be that's a cause for concern? If only the first image is giving you right detections, it means that the step size is wrong somehow.
I finally get it working. First of all, all snippets above are correct.
The truth is, in my image batch, all images are with different size and totally different backgrounds. They are randomly downloaded from Google.
When I use images from a same video sequence(thus with same size), it works well.
I could not figure out the detail reason, but it works now. And thanks @saihv for your kind advice.
For what ever reason, the glaring context of this thread has eluded my conscience. Unsurprisingly, the 'solution' only fits YOLOv2.
Unfortunately, I cannot reproduce the snippets contributed by @anl13. The process stalled momentarily at get_network_boxes and then exited. It works for YOLOv2 just dandy.
Adapting my YOLOv2 codes to @anl13's based on @saihv's neat explanations leads to nowhere:
for (int k = 0; k < net.n; ++k) {
layer temp_l = net.layers[k];
if (temp_l.type == YOLO || temp_l.type == REGION || temp_l.type == DETECTION)
l.output += l.outputs;
}
However, at least the process churns out inexplicable figures until the final batch - a classic sign of wrong step size. Actually, I don't have to add that new loop to 'achieve' gibberish outcome, existing codes already could.
Any pointers?
That's a bit confusing @jstumpin . I could get correct results on both yolov3 and yolov2 with my snippets. I have tested my code on thousands of frames with batch size 2 and 4. So I think l.outputs is the right stepping.
I haven't check all the details of these layers thoroughly, so I couldn't tell what's wrong with your results.
By the way, I add some lines of code yesterday in my snippets:
for(int j=0;j<net.n;j++) // reset layer output pointer for free.
{
layer& temp_l = net.layers[j];
if(temp_l.type==YOLO || temp_l.type==REGION || temp_l.type==DETECTION)
{
for(int i=0;i<net.batch;i++)
temp_l.output = temp_l.output - temp_l.outputs;
}
}
These lines are used to reset l.output to original pointer position, to avoid some weird segmentation fault.
@anl13 So, just to confirm, the problem was just that you had images of multiple sizes? And once you corrected that, using a step size of l.outputs on yolo, region and detection layers helped with batch detections?
(Intuitively, yes, the default approach wouldnt be amenable for images of different sizes)
The default yolo_v2_class.cpp has already taken care of standardization of image sizes within Detector::detect function body as shown in various snippets above. In fact, resizing has already taken place in functions defined within yolo_v2_class.hpp prior to the function in question. Speaking of function, I assume @anl13 is also using the same constructor as yolo_v2_class.cpp for parsing/loading the config/weights? If so, it would add to the mystery as to why am I the only one facing the 'stalling' curse on YOLOv3 using his snippets. Omitting set_batch_network(&net, batch_size); which I suggested earlier to handle dynamic batch sizes, changes nothing.
I'd like to do batch inference using the Python wrapper.
My desired interface is:
yolo = YOLOv3(batch_size=32)
images = []
for i in range(32):
success, image = video_capture.read()
images.append(image)
images = np.array(images)
results = yolo.batch_inference(images)
Where could I put @anl13 's C/C++ code to enable this?
I think yes @saihv
My constructor is similar to yolo_v2_class.cpp with batch size modified @jstumpin just as former discussions suggested(It is essential to do set_batch_network(&net, batch_size) ). Here is my complete code(forked from this repository) https://github.com/anl13/darknet. Maybe helpful somehow.
@pawarren I once tried to change python wrapper, but found it a bit complicated. I think you can integrate the snippets in a C function, and export it to darknet.so library, and define corresponding data type and functions in darknet.py, then you can use it. I haven't implement that(I do not have time to work on it these days), but I think it would work.
Culprit identified as if (l.batch == 2) avg_flipped_yolo(l);, FYI/FYA @AlexeyAB. Crisis averted, much kudos to @anl13 for pointing this out in his repo.
At least for my case, set_batch_network(&net, batch_size) is only needed as per original purpose - dynamic batch size.
I ditched the old ways of get_region_boxes and replacing it with get_network_boxes, again as per @anl13's snippets. The former method proved to be nearly impossible to step-size, specifically for YOLOv3.
Finally got it working.
From YOLODLL_API Detector::Detector constructor of yolo_v2_class.cpp of yolo_cpp_dll project, specify the batch size (tried net.batch = batch_size elsewhere, didn't work):
The solution by @jstumpin works for
yolov2. But foryolov3, the strategy to compute bboxes from network output is quite different. It depends on not onlynet.layers[net.n-1], but also other layers with typeYOLO(this layer type only exists in yolov3).My solution to work with batch detection on yolov3 is as follows. All input images have been resized(to network size) and normalized.
// assume channel 3 // img_ptrs is of type std::vector< std::shared_ptr<image_t> > to // properly transfer image data. float *X = (float*)calloc(net.batch*net.w*net.h*3,sizeof(float)); for(int i=0;i<net.batch;i++) { image im; im.c = img_ptrs[i]->c; im.w = img_ptrs[i]->w; im.h = img_ptrs[i]->h; im.data = img_ptrs[i]->data; image sized; if(net.w==im.w && net.h==im.h) { sized = make_image(im.w,im.h,im.c); memcpy(sized.data, im.data, im.w*im.h*im.c*sizeof(float)); } else sized = resize_image(im, net.w, net.h); memcpy(X+i*net.h*net.w*3, sized.data, net.h*net.w*3*sizeof(float)); free(sized.data); } // predict network_predict(net, X); // layer l = net.layers[net.n-1]; // get bbox std::vector< std::vector<bbox_t> > bbox_vec_batch; for(int j=0;j<net.batch;j++) { int nboxes = 0; int letterbox = 0; float hier_thresh = 0.5; int nms=0.4; detection* dets = get_network_boxes(&net,img_ptrs[j]->w,img_ptrs[j]->h, thresh, hier_thresh, 0,1,&nboxes, letterbox); do_nms_sort(dets, nboxes, l.classes,nms); std::vector<bbox_t> bbox_vec; for(int i=0;i<nboxes;++i) { box b = dets[i].bbox; const int obj_id = max_index(dets[i].prob, l.classes); const float prob = dets[i].prob[obj_id]; if(prob>thresh) // thresh is given { bbox_t bbox; bbox.x = std::max((double)0,(b.x-b.w/2.)*img_ptrs[j]->w); bbox.y = std::max((double)0,(b.y-b.h/2.)*img_ptrs[j]->h); bbox.w = b.w*img_ptrs[j]->w; bbox.h = b.h*img_ptrs[j]->h; bbox.obj_id = obj_id; bbox.prob = prob; bbox.track_id = 0; bbox_vec.push_back(bbox); } } bbox_vec_batch.push_back(bbox_vec); free_detections(dets, nboxes); // stepping for(int j=0;j<net.n;j++) { layer& temp_l = net.layers[j]; if(temp_l.type==YOLO || temp_l.type==REGION || temp_l.type==DETECTION) { // temp_l.output += temp_l.h*temp_l.w*temp_l.n*(temp_l.classes + temp_l.coords + 1); temp_l.output = temp_l.output + temp_l.outputs; } } } for(int j=0;j<net.n;j++) // reset layer output pointer for free. { layer& temp_l = net.layers[j]; if(temp_l.type==YOLO || temp_l.type==REGION || temp_l.type==DETECTION) { for(int i=0;i<net.batch;i++) temp_l.output = temp_l.output - temp_l.outputs; } } if(X) free(X);The main body of my solution is similar to that of @jstumpin 's solution. And thanks @jstumpin @AlexeyAB For sharing their code.
@anl13 I was going through your darknet repo , I am not able to see the predicted file, as well how to input multiple files , like we can do for @AlexeyAB repo. Please help
Stepping does not work as expected with any of the snippets you posted here.
Has anyone tried network_predict_data_multi or network_predict_data ? Those functions are also accessible through python wrapper. But I couldn't figure out how to specify argument type _DATA_, and result type.
predict_multi_image = lib.network_predict_data_multi
predict_multi_image.argtypes =
predict_multi_image.restype =
Hi all,
I tried to add batch inference to the existing codebase in #4099. It worked for me. Please take a look when you have time.
Most helpful comment
Finally got it working.
From YOLODLL_API Detector::Detector constructor of yolo_v2_class.cpp of yolo_cpp_dll project, specify the batch size (tried _net.batch = batch_size_ elsewhere, didn't work):
Add/edit the following lines in bold to the previous post:
Benchmarked on 200 samples (second last column and last column represent average and total run time in second, respectively):
Thanks @AlexeyAB @panda9095 @saihv