I made this issue so we could track status and discuss NEON implementation
Related issue: https://github.com/cmusatyalab/openface/issues/157
I lack SIMD experience, but I'll try to do this at some point in the future if no one does it sooner.
Cool. For the record, I'm not doing it any time soon :)
@grisevg, I have made the very first version. you can check https://github.com/fastfastball/arm_neon_for_dlib_simd
@grisevg I've partially used @fastfastball's patch (only simd4f/simd8f modifications) and had no problems with it. This boosts face detector about 3-5x times without stability or visible accuracy issues.
@radioneko did you see any performance increase on the dnn detection?
The DNN doesn't use any SIMD instructions, so it won't make a difference.
Although, it's possible to create a DNN that uses HOG features as it's first few layers, which is something I've been meaning to do. Then there would be a significant speedup. But that's not in dlib yet.
@radioneko how did you modify simd8f? I haven't found 256 bit SIMD support in NEON yet.
@pythonanonuser, I did not modify simd8f, because as you said ARM does not support 256bits SIMD. However, the performance of simd8 becomes better. That's because the implementation of simd8f is based on simd4 and simd4 is speeded up by NEON.
@fastfastball While thinking about the 128 SIMD limitation on ARM, I came across this somewhat recent development: https://community.arm.com/processors/b/blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
ARM now has scalable vector extensions (SVE) which might be able to help with porting over the rest of the SIMD code. I am not very experienced with SIMD so I am not sure how well it would work with NEON, but figured I'd give you a heads up.
@pythonanonuser, Thanks for your info. I will take a look to see if we can improve dlib performance further for ARM platform :)
ARM SVE was recently announce and not yet available in silicon.
For better NEON optimizations, https://github.com/google/gemmlowp is optimized for NEON might be a good example.
3-5x sounds great but looking at https://groups.google.com/forum/#!topic/cmu-openface/o6u3rCS9XLQ it appears there is likely still a big gap. Any ideas on what needs looked at next?
Hi fastfastball,@fastfastball How to use your code for arm?Can you provide an example?Thanks!
@chrisluu , I am using dlib under Android 5.0 on arm cortex-a53 platform. For development under Android, you can first download https://github.com/tzutalin/dlib-android. This project is buildable under android build environment. You can build it first. And, then replace its dlib folder with mine.
For enabling neon, you can check https://github.com/fastfastball/dlib_for_arm/issues/4 .
@fastfastball Thanks for your reply.I have tested your code on Android 5.0 on arm cortex-A52 platform and turned NEON option on, but didn't get the performance as you declared here fastfastball/dlib_for_arm#3.In my experiment, I reduced origin input 720x720 to 180x180, the result is not good enough just 70ms for face detection (yours is 70ms per 640x480 image)and 20ms for alignment.I am not sure if there is some wrong with my configuration ?Do you have any idea about that?
@fastfastball when will your SIMD NEON additions be merged here?
We have just merged the SIMD/NEON support. Now its a good time for testing it.
I have tested on NVidia Tegra TK1 (ARM15) and it works fine for me. Who can try latest Dlib version from GIT on other ARM processors? My measurements are in #557
@fastfastball , can you check the code (simd4f.h and simd4i.h files only)? May be you will have some ideas about making it better? I have started with your implementation, but it failed to pass the automatic tests. So current Dlib implementation is different
@e-fominov is there some special build flags to set with cmake? Also I make use of the python bindings and build it with the setup.py, is there anything I need to set there?
@Climax777 , yes, you need to use "-mpfu=neon" compoler flags, but not every ARM processor support it. we have seen, that Respberry Pi ver.1 does not support NEON
@e-fominov, great work! It definitely speeds things up on the RPI3, (+8s to ~1.4) unfortunately still not enough for real-time use for me. I'm sticking with viola-jones detection for now.
@Climax777 Please let we know about you results with viola-jones or others methods. I'm using a RPI3 Model B too and the results are far to be good.
@BarbaL I'm using opencv for the viola-jones facial detection.
There are NEON instructions in the main dlib source now. So if you get the latest source from github and enable NEON with the compiler flag "-mpfu=neon" you will be good to go.
Is there a CMAKE option to turn on NEON now. Also is there a way to have some of the speed ups for facial landmark detection and face detection committed as well, as NEON alone isn't fast enough at this point.
It's already committed. That's what I'm saying. You don't need any
special cmake option, you just use the compiler switch I mentioned.
Great, appreciate it Davis!
I don't know. You type it on the command line with a keyboard? :)
You have to have some idea how to program and use a compiler to use dlib. So before you get into dlib you really need to learn to program.
Oh, forgot you are using python. You can pass options to cmake via setup.py. I forget exactly what you do and it's hard to look up on my phone right now. Might be with --
Looks like you need to edit setup.py to add it. Or use cmake directly to compile dlib which is what I would do. Then you can use ccmake or cmake-gui to set the compiler options to whatever you want.
Hello
I am trying to build dlib on Raspberry Pi 3. Will this command
python3 setup.py install --compiler-flags "-mpfu=neon" --yes USE_AVX_INSTRUCTIONS --no DLIB_USE_CUDA
pass the correct flags to dlib so that neon support is enabled?
Thank you?
No, don't pass any AVX or CUDA switches since the Raspberry Pi doesn't have AVX instructions or CUDA support.
Hello
Thank you @davisking for the reply!
I tried to install the latest master of dlib directly on the Raspberry Pi with the following command
sudo python3 setup.py install --compiler-flags "-mpfu=neon"
Somewhere in the error log I am seeing:
c++: error: unrecognized command line option '-mpfu=neon'
What I am doing wrong?
Here is the complete stacktrace:
running install
Checking .pth file support in /usr/local/lib/python3.4/dist-packages/
/usr/bin/python3 -E -c pass
TEST PASSED: /usr/local/lib/python3.4/dist-packages/ appears to support .pth files
running bdist_egg
running build
Detected Python architecture: 32bit
Detected platform: linux
Removing build directory /home/pi/workspace/dlib/./tools/python/build
Configuring cmake ...
-- The C compiler identification is GNU 4.9.2
-- The CXX compiler identification is unknown
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- broken
CMake Error at /usr/share/cmake-3.6/Modules/CMakeTestCXXCompiler.cmake:54 (message):
The C++ compiler "/usr/bin/c++" is not able to compile a simple test
program.
It fails with the following output:
Change Dir: /home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeTmp
Run Build Command:"/usr/bin/make" "cmTC_ad5da/fast"
/usr/bin/make -f CMakeFiles/cmTC_ad5da.dir/build.make
CMakeFiles/cmTC_ad5da.dir/build
make[1]: Entering directory
'/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_ad5da.dir/testCXXCompiler.cxx.o
/usr/bin/c++ -mpfu=neon -o CMakeFiles/cmTC_ad5da.dir/testCXXCompiler.cxx.o
-c
/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
c++: error: unrecognized command line option '-mpfu=neon'
CMakeFiles/cmTC_ad5da.dir/build.make:65: recipe for target
'CMakeFiles/cmTC_ad5da.dir/testCXXCompiler.cxx.o' failed
make[1]: *** [CMakeFiles/cmTC_ad5da.dir/testCXXCompiler.cxx.o] Error 1
make[1]: Leaving directory
'/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_ad5da/fast' failed
make: *** [cmTC_ad5da/fast] Error 2
CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
CMakeLists.txt
-- Configuring incomplete, errors occurred!
See also "/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeOutput.log".
See also "/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeError.log".
error: cmake configuration failed!
It's mfpu. You've mistyped it.
On Tue, 13 Jun 2017, 20:44 Dorian, notifications@github.com wrote:
Hello
Thank you @davisking https://github.com/davisking for the reply!
I tried to install the latest master of dlib with the following command
sudo python3 setup.py install --compiler-flags "-mpfu=neon"
Somewhere in the error log I am seeing:
c++: error: unrecognized command line option '-mpfu=neon'
What I am doing wrong?
Here is the complete stacktrace:
running install
Checking .pth file support in /usr/local/lib/python3.4/dist-packages/
/usr/bin/python3 -E -c pass
TEST PASSED: /usr/local/lib/python3.4/dist-packages/ appears to support .pth files
running bdist_egg
running build
Detected Python architecture: 32bit
Detected platform: linux
Removing build directory /home/pi/workspace/dlib/./tools/python/build
Configuring cmake ...
-- The C compiler identification is GNU 4.9.2
-- The CXX compiler identification is unknown
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- broken
CMake Error at /usr/share/cmake-3.6/Modules/CMakeTestCXXCompiler.cmake:54 (message):
The C++ compiler "/usr/bin/c++" is not able to compile a simple test
program.
It fails with the following output:
Change Dir: /home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeTmp
Run Build Command:"/usr/bin/make" "cmTC_ad5da/fast"
/usr/bin/make -f CMakeFiles/cmTC_ad5da.dir/build.make
CMakeFiles/cmTC_ad5da.dir/build
make[1]: Entering directory
'/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeTmp'
Building CXX object CMakeFiles/cmTC_ad5da.dir/testCXXCompiler.cxx.o
/usr/bin/c++ -mpfu=neon -o CMakeFiles/cmTC_ad5da.dir/testCXXCompiler.cxx.o
-c
/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
c++: error: unrecognized command line option '-mpfu=neon'
CMakeFiles/cmTC_ad5da.dir/build.make:65: recipe for target
'CMakeFiles/cmTC_ad5da.dir/testCXXCompiler.cxx.o' failed
make[1]: * [CMakeFiles/cmTC_ad5da.dir/testCXXCompiler.cxx.o] Error 1
make[1]: Leaving directory
'/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_ad5da/fast' failed
make: * [cmTC_ad5da/fast] Error 2
CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
CMakeLists.txt
-- Configuring incomplete, errors occurred!
See also "/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeOutput.log".
See also "/home/pi/workspace/dlib/tools/python/build/CMakeFiles/CMakeError.log".
error: cmake configuration failed!—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/davisking/dlib/issues/276#issuecomment-308211049, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAu7ndoxqSB_4GrbzJIonlNDHuCTYauQks5sDtiDgaJpZM4KRUlz
.
Damn! Thank you @Climax777 !! Now it works :-)
@radioneko which file is modifyed? Is any difference with the newest dlib?
Hi @davisking @Climax777
I'm sorry if this annoys you but I've run
sudo python3 setup.py install --compiler-flags "-mfpu=neon"
and that error still occurs:
cc: error: unrecognized command line option ‘-mfpu=neon’
It was run on Jetson Nano, Ubuntu 18.04, CMake 3.10.2
I've spent a whole day with it. Please help my with this issue
Most helpful comment
@pythonanonuser, Thanks for your info. I will take a look to see if we can improve dlib performance further for ARM platform :)