Darknet: yolov3-tiny_xnor.cfg running on ARM

Created on 12 Feb 2019 · 53Comments · Source: AlexeyAB/darknet

Hi @AlexeyAB,

I am trying to run yolov3-tiny_xnor.cfg for detection in a raspberry pi. I have trained the network, tested it on an Intel-based system and it just works fine. However, when I run it on the RPi, nothing is detected! I am using the very same command and the very same version of the framework on both sides. Can you help me figure out what is going on?

I am using the command
./darknet detector test data/coco.data cfg/yolov3-tiny_xnor.cfg yolov3-tiny_xnor_last.weigths data/person.jpg

The content of coco.data is

classes = 80
names   = data/coco/coco.names
backup  = backup/

The content of yolov3-tiny_xnor.cfg

[net]
# Testing
batch=1
subdivisions=1
# Training
# batch=64
# subdivisions=2
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
xnor=1
bin_output=1
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

###########

[convolutional]
xnor=1
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=255
activation=linear



[yolo]
mask = 3,4,5
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes=80
num=6
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
xnor=1
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 8

[convolutional]
xnor=1
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky


[convolutional]
size=1
stride=1
pad=1
filters=255
activation=linear

[yolo]
mask = 0,1,2
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes=80
num=6
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

The .weights file can be found here: http://www.mediafire.com/file/vahpux9xefw1tci/yolov3-tiny_xnor_122000.weights

Finally, the person.jpg image is the one already present in the data folder.

Bug fixed

Source

joaomiguelvieira

Most helpful comment

@AlexeyAB, as promised, here’s the paper: https://web.tecnico.ulisboa.pt/~joaomiguelvieira/public/docs/papers/a_product_engine_for_energy-efficient_execution_of_binary_neural_networks_using_resistive_memories.pdf

joaomiguelvieira on 27 Nov 2019

👍2

All 53 comments

@joaomiguelvieira Hi,

What parameters in the Makefile do you use in both cases?
What command do you use?
Does any other yolo-model work fine on RPi?
Just to check, does yolov3-tiny_xnor_122000.weights work on RPi if you comment this line? https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1246

AlexeyAB on 12 Feb 2019

Hi @AlexeyAB,

I don’t set any parameters on the Makefile. All the accelerations are disabled.

The command I am using is simply ./darknet detector coco.data yolov3-tiny_xnor.cfg yolov3-tiny_xnor_122000.weights data/person.jpg

Every other models I tried on RPi work just fine. The xnor version of yolov3 is the one that doesn’t work.

I will give you feedback about commenting the line in a moment.

joaomiguelvieira on 12 Feb 2019

Hi @AlexeyAB,

Sorry for the late answer. After commenting the line, the process does not seem to finish. It has been running for a long time now. Either way, I will let it run to see if it finishes or not.

joaomiguelvieira on 12 Feb 2019

@joaomiguelvieira Hi,

If the detection-process does not end, try to comment out this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1302

Also try to set OPENMP=1 in Makefile then it will work faster.

I didn't test it on RPi, so I don't know can it work fine on it.

AlexeyAB on 12 Feb 2019

Hi @AlexeyAB,

After all, it just finished running.

If I comment this line it works just fine and detects the objects. Should I run it with this line commented, then?
https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1246

joaomiguelvieira on 12 Feb 2019

Should I run it with this line commented, then?

No. It disables XNOR acceleration, so it uses XNOR in floats, just to make sure everything else in code is OK.

Do you use 32-bit or 64-bit OS on RPi?

AlexeyAB on 12 Feb 2019

Hi again @AlexeyAB,

I tried on both (a RPi 2 armv7 and a RPi 3 armv8). Neither works. On RPi 3 the OS is 64-bit and on the RPi 2 the OS is 32-bit.

joaomiguelvieira on 12 Feb 2019

@joaomiguelvieira

Un-comment this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/detector.c#L1246

Try to change this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2098
to these:

                int tmp_count = 0;// popcnt_64(c_bit64);
                int z;
                for (z = 0; z < 64; ++z) {
                    tmp_count += (c_bit64 & 1);
                    c_bit64 = c_bit64 >> 1;
                }

Does it work for XNOR?

Also does lscpu show that your RPi 3 is a little endian or big endian?

The ARM architecture can run both little and big endianess, but Raspberry usually uses a little endian mode, the same as x86_64, so it shouldn't be a problem.

AlexeyAB on 12 Feb 2019

Hi @AlexeyAB,

That solved the problem. So it seems that popcnt_64(c_bit64); is not doing what it should.

However, I am guessing that this pop function should be much faster than the for loop, am I right?

joaomiguelvieira on 12 Feb 2019

@joaomiguelvieira

That solved the problem. So it seems that popcnt_64(c_bit64); is not doing what it should.

However, I am guessing that this pop function should be much faster than the for loop, am I right?

Yes. So we just localized the problem.

It looks like a bug.

Rollback all previous changes.

And try to change this line:
https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2072
to this:
tmp_count += __builtin_popcount(val64 >> 32);

Do make and try to detect.

If it helps, then do another one change:

Change this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2068
to this
#if defined(__x86_64__) || defined(__aarch64__)

AlexeyAB on 12 Feb 2019

@AlexeyAB, this works for ARM 32-bit. I will update you about the 64-bit arm in a moment.

joaomiguelvieira on 12 Feb 2019

@joaomiguelvieira
On ARM 64-bit you can try changes as previous, and temporary comment this line: https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2072

to make sure the code uses https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2069 instead of https://github.com/AlexeyAB/darknet/blob/00e992a600ba781da635b4f75fc02b2458639c4e/src/gemm.c#L2071-L2072

AlexeyAB on 12 Feb 2019

Hi @AlexeyAB,

This also works for 64-bits.

joaomiguelvieira on 12 Feb 2019

👍1

@joaomiguelvieira Hi,

Thanks for your tests!
I added this fix: https://github.com/AlexeyAB/darknet/commit/449fcfed7547a9203a7f44afd37835d373268201

AlexeyAB on 13 Feb 2019

What are your image and video response times for Pi 3 with this working configuration? I assume it compiled with OpenCV 4x?

spinoza1791 on 14 Feb 2019

Hi @spinoza1791,

My goal is to accelerate the XNORNet of tiny YoloV3 using embedded systems. Therefore, I did not measure response times. Nevertheless, should you be interested in measuring them and I can give you all the support files to do so (I cannot do it myself since I do not have access to the hardware anymore).

joaomiguelvieira on 14 Feb 2019

@joaomiguelvieira

My goal is to accelerate the XNORNet of tiny YoloV3 using embedded systems.

Do you plane to implement XNORnet by using SIMD ARM?

AlexeyAB on 14 Feb 2019

Hi @AlexeyAB,

I plan to do something a bit different than that. I plan to change ARM architecture (using gem5) to include a binary dot product unit. That will accelerate the binary convolution significantly.

joaomiguelvieira on 14 Feb 2019

Hi @spinoza1791,

My goal is to accelerate the XNORNet of tiny YoloV3 using embedded systems. Therefore, I did not measure response times. Nevertheless, should you be interested in measuring them and I can give you all the support files to do so (I cannot do it myself since I do not have access to the hardware anymore).

Yes, any support files you have would be great! andrew.[email protected]

spinoza1791 on 15 Feb 2019

Hi @spinoza1791,

You can find the .weights file I used (trained myself) for coco dataset at https://www.mediafire.com/file/dm45vmfedz6e73z/yolov3-tiny_xnor_last.weights/file

The configuration file that you should use is located in cfg/yolov3-tiny_xnor.cfg. You should, however, edit it before start detecting (set batch=64 and subdivisions=1).

The coco.data file is located under cfg/coco.data. Edit this file and put your own directories.

The coco.names is located under cfg/coco.names.

You can use the images at data/ to detect objects.

Should you need anything else, let me know.

joaomiguelvieira on 15 Feb 2019

@joaomiguelvieira

I plan to do something a bit different than that. I plan to change ARM architecture (using gem5) to include a binary dot product unit. That will accelerate the binary convolution significantly.

Do you want to implement just fused XOR+POPCNT as SIMD instructions for ARM architecture?
Or do you want to implement even SIMD-GEMM like WMMA-GEMM on Tensor Cores on nVidia GPU (wmma::bmma_sync() that does dot-product on 8x8x128 bits matrix).

AlexeyAB on 15 Feb 2019

@AlexeyAB,

I implemented XOR+POPCNT as a single instruction for the ARMv8 architecture. However, it goes a little further than that: some filters are stored in the execution path of the CPU, in a special memory. When those filters are used, the program does not have to load them from the main memory. This can accelerate the execution in the order of 20%. Furthermore, memory accesses are reduced, and this solution also has some benefits in terms of energy efficiency.

This is a research project, so I would not expect such architecture to become available soon. Nevertheless, it is an interesting topic.

joaomiguelvieira on 15 Feb 2019

👍1

Results: 2.8x speedup on Pi 3B (armv7l w/ OpenMP and ARM_NEON optimized in Makefile)
Loading weights from yolov3-tiny_xnor_last.weights...
seen 64
Done!
data/person.jpg: Predicted in 1778.256000 milli-seconds.
person: 37%
VS.
Loading weights from yolov3-tiny.weights...
seen 64
Done!
data/person.jpg: Predicted in 5112.976000 milli-seconds.
dog: 89%
dog: 82%
person: 98%
sheep: 83%

If we combined the xnor work here with the NNPack work of https://github.com/shizukachan/darknet-nnpack (it is currently not updated with working xnor functions), we would have a close to real-time FPS for Pi3. For example, I can get 1.1 FPS (on Pi3) with yolov3-tiny.weights using nnpack (better optimization). If the nnpack were made compatible with xnor, we could potentially see yolov3-tiny_xnor running at ~3-4 FPS!

spinoza1791 on 24 Feb 2019

@spinoza1791

Results: 2.8x speedup on Pi 3B (armv7l w/ OpenMP and ARM_NEON optimized in Makefile)

Loading weights from yolov3-tiny_xnor_last.weights...
data/person.jpg: Predicted in 1778.256000 milli-seconds.
...
Loading weights from yolov3-tiny.weights...
data/person.jpg: Predicted in 5112.976000 milli-seconds.

Did you optimize this repo https://github.com/AlexeyAB/darknet by using ARM_NEON?

AlexeyAB on 24 Feb 2019

@spinoza1791

Results: 2.8x speedup on Pi 3B (armv7l w/ OpenMP and ARM_NEON optimized in Makefile)
Loading weights from yolov3-tiny_xnor_last.weights...
data/person.jpg: Predicted in 1778.256000 milli-seconds.
...
Loading weights from yolov3-tiny.weights...
data/person.jpg: Predicted in 5112.976000 milli-seconds.

Did you optimize this repo https://github.com/AlexeyAB/darknet by using ARM_NEON?

Yes, I used the ARM_NEON opt in my test above. ARM_NEON + OpenMP in my test above is still much slower than ARM_NEON + NNPack (1.1FPS on yolov3_tiny) for Pi3. The NNPack multithreading is far better optimized for Pi than OpenMP. This is why we should add an NNPack opt to this repo, so that we can combined Xnor + ARM_NEON + NNPack for best results on Pi.

spinoza1791 on 24 Feb 2019

@spinoza1791

Did you try Yolo v3 on Pi3 that is implemented inside OpenCV-dnn module? https://github.com/opencv/opencv/blob/8bde6aea4ba19454554aa008922d967b552e79cc/samples/dnn/object_detection.cpp#L192-L222

What time and FPS can you get?

AlexeyAB on 24 Feb 2019

@spinoza1791

Did you try Yolo v3 on Pi3 that is implemented inside OpenCV-dnn module? https://github.com/opencv/opencv/blob/8bde6aea4ba19454554aa008922d967b552e79cc/samples/dnn/object_detection.cpp#L192-L222

What time and FPS can you get?

Here are my benchmarks for Pi3B+ all using OpenCV w/NEON/FPV4/TBB support
Image detector test results, in secs, are best-of-five trials, using "person.jpg":

yolov2-tiny - ocv3.4.0 - DNN module
openmp 1.21s

yolov3-tiny - ocv4.0.1 - DNN module
openmp 1.32s

yolov2-tiny - ocv4.0.1 - DNN module
openmp 1.41s

yolov3-tiny - ocv3.4.0 - darknet(alexeyAB)
openmp 8.46s
openmp+neon 4.90s

yolov3-tiny_xnor - ocv3.4.0 - darknet(alexeyAB)
openmp 1.95s
openmp+neon 1.56s

yolov3-tiny - ocv3.4.0 - darknet-nnpack(shizukachan)
openmp 9.66s
openmp+neon 5.37s
nnpack 0.81s
nnpack+neon 0.75s Best result

spinoza1791 on 2 Mar 2019

👍1

@spinoza1791
Maybe I will try to optimize XNOR for ARM CPU.

AlexeyAB on 2 Mar 2019

❤2

I used his weight and cfg file,in the darknet-no-gpu version,but found the error,
Done!
not used FMA & AVX2
used AVX
error: is no non-optimized version

My cpu support the AVX directive.

andeyeluguo on 28 Apr 2019

@andeyeluguo

What CPU do you use?
Do you compile for x64 & Release? https://hsto.org/webt/uh/fk/-e/uhfk-eb0q-hwd9hsxhrikbokd6u.jpeg

AlexeyAB on 28 Apr 2019

I used Intel Xeon E5 2697 v2
I generate for x64 &release

andeyeluguo on 28 Apr 2019

maybe the reason is that your code support AVX2,but not AVX, I think.

andeyeluguo on 28 Apr 2019

the code in gemm.c in the function im2col_cpu_custom_bin(),shows that you only fulfill the avx2 version,maybe

andeyeluguo on 28 Apr 2019

@andeyeluguo

Try to change this line: https://github.com/AlexeyAB/darknet/blob/cce34712f6928495f1fbc5d69332162fc23491b9/src/gemm.c#L515
to this
#if (defined(__AVX__) && defined(__x86_64__)) || defined(_WIN64_DISABLED)

AlexeyAB on 28 Apr 2019

it can process a image,but it shows that
not compiled with opencv

when I try to run it with a video,
darknet_no_gpu.exe detector demo data/coco.data joao_xnor/yolov3-tiny_xnor.cfg joao_xnor/yolov3-tiny_xnor_122000.weights -i 1 ./video/room.mp4
it shows that
'Demo needs OpenCV for webcam images'

andeyeluguo on 29 Apr 2019

@andeyeluguo Yes, you should compile Darknet with OpenCV to process video from files/cameras: https://github.com/AlexeyAB/darknet#requirements

AlexeyAB on 29 Apr 2019

@joaomiguelvieira Hi,

I plan to do something a bit different than that. I plan to change ARM architecture (using gem5) to include a binary dot product unit. That will accelerate the binary convolution significantly.

Do you have any interesting results?

AlexeyAB on 24 Jul 2019

Hi, @AlexeyAB,

Indeed I have some interesting results. I was able to improve 10% performance and 8% energy efficiency for the Cortex-A53. I will be able to share the artifact very soon, as this work is in the process of being published.

joaomiguelvieira on 24 Jul 2019

👍1

@joaomiguelvieira Thanks, it will be interesting. Can you share your source C code of XNOR_GEMM that you test on your changed ARM architecture?

AlexeyAB on 24 Jul 2019

Hi @AlexeyAB,

Certainly. Please, find the files that I modified attached.
darknet_mod.zip

As soon as I get permission, I will also send you the paper so you can see in detail what I changed.

joaomiguelvieira on 24 Jul 2019

👍2

@joaomiguelvieira Thanks!

AlexeyAB on 24 Jul 2019

Maybe I will try to optimize XNOR for ARM CPU.

Has this optimization been done on the darknet?

EhsanVahab on 27 Nov 2019

joaomiguelvieira on 27 Nov 2019

👍2

Dear @joaomiguelvieira, thanks for your comment
I need some reference to know how can I make a BCNN by darknet and implement it on a raspberry pi.
would it be possible to guide me?

EhsanVahab on 27 Nov 2019

Hello, @EhsanVahab,
Darknet already has some BNNs ready to use out-of-the-box. The corresponding configuration files have the suffix _xnor.cfg.
To create a BNN from an existing CNN configuration file you should add the line xnor=1 in all convolutional layers except for the first one. For instance, compare the files tiny-yolo_xnor.cfg and tiny-yolo.cfg. You will see that the only difference is that in tiny-yolo_xnor.cfg the first line of each convolutional layer except for the first one is xnor=1.
After configuring your BNN, you will have to train it. There is no easy way of binarizing the weights of a pre-trained CNN, so it will have to be trained from scratch. If you want to run the flow and get started with darknet and BNNs before getting to trainning, I suggest you try to use the .weights file that I have supplied earlier in this thread: https://www.mediafire.com/file/dm45vmfedz6e73z/yolov3-tiny_xnor_last.weights/file. These weights refer to the Street View House Numbers dataset.
I hope this may help you.

joaomiguelvieira on 27 Nov 2019

👍1

@joaomiguelvieira , thanks for your quick and complete response to my question.
I will train my model then share the results with you, accuracy and speed on rasp pi.
your comment is very helpful.

EhsanVahab on 27 Nov 2019

@joaomiguelvieira @AlexeyAB nice job done. I was wonder how the xnor_weight came from? Is there some way to get? I try the tiny_v3_pan3.cfg to compare the yolov3_tiny_3l.cfg, the xnor pretrain model the yolov3_tiny_3l I can get from the yolov3_tiny_xnor_last.weights, but how the the tiny_v3_pan3 pretrain model get? Is there some way how to get the xnor pretrain model?

PiseyYou on 23 Mar 2020

@PiseyYou

All v3_tiny... models use the same first several layers. So you can use the same partial pre-traied weights file for: yolov3_tiny, yolov3_tiny_3l, tiny_v3_pan3....

AlexeyAB on 23 Mar 2020

@AlexeyAB Thanks, AlexyAB, I will have a try and will feedback later.

PiseyYou on 8 Apr 2020

@AlexeyAB Hi, I had trained the yolov3-tiny_xnor and wanted to print the data of feature maps of middle layers during 1 image detection. I don't know how to modify your code?

Hugh-Chang on 17 Apr 2020

@Hugh-Chang Un-comment this part of code: https://github.com/AlexeyAB/darknet/blob/342a8d1561c19317f2d5fda0f099449b79b51716/src/network_kernels.cu#L119-L136

of add
if (i == 10) {
and
}
around this code to show only layer-10.

AlexeyAB on 17 Apr 2020

@AlexeyAB Thx. I will take a try.

Hugh-Chang on 17 Apr 2020

@AlexeyAB Hi, I tried ur modifies and it still remains some problems.
The code I modified as below.
if (i == 10) {
cuda_pull_array(l.output_gpu, l.output, l.batchl.outputs);
if (l.out_w >= 0 && l.out_h >= 1 && l.c >= 3) {
int j;
for (j = 0; j < l.out_c; ++j) {
image img = make_image(l.out_w, l.out_h, 3);
memcpy(img.data, l.output + l.out_wl.out_hj, l.out_wl.out_h * 1 * sizeof(float));
memcpy(img.data + l.out_wl.out_h * 1, l.output + l.out_wl.out_hj, l.out_wl.out_h * 1 * sizeof(float));
memcpy(img.data + l.out_wl.out_h * 2, l.output + l.out_wl.out_hj, l.out_wl.out_h * 1 * sizeof(float));
char buff[256];
sprintf(buff, "layer-%d slice-%d", i, j);
show_image(img, buff);
save_image(img, buff);
}
cvWaitKey(0); // wait press-key in console
cvDestroyAllWindows();
}
}

The bugs when I released were
identifier "cvWaitKey" is undefined darknet D:\darknet\darknet-mastersrc\network_kernels.cu 134

identifier "cvDestroyAllWindows" is undefined darknet D:\darknet\darknet-mastersrc\network_kernels.cu 135

MSB3721 命令“"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\nvcc.exe" -gencode=arch=compute_30,code=\"sm_30,compute_30\" -gencode=arch=compute_75,code=\"sm_75,compute_75\" --use-local-env -ccbin "F:\Visual Studio\VC\binx86_amd64" -x cu -IC:\opencv\build\include -IC:\opencv_3.0\opencv\build\include -I....\include -I....\3rdparty\stb\include -I....\3rdparty\pthreads\include -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include" -I\include -I\include -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DOPENCV -DCUDNN_HALF -DCUDNN -D_TIMESPEC_DEFINED -D_SCL_SECURE_NO_WARNINGS -D_CRT_SECURE_NO_WARNINGS -D_CRT_RAND_S -DGPU -DWIN32 -D_CONSOLE -D_LIB -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc140.pdb /FS /Zi /MD " -o x64\Release\network_kernels.cu.obj "D:\darknet\darknet-mastersrc\network_kernels.cu"”已退出，返回代码为 1。 darknet C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\BuildCustomizations\CUDA 10.0.targets 712

Hugh-Chang on 17 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings