Enable and optimize support for POWER8 VSX SIMD instructions on PPC64LE Linux to dlib/simd.
$$$$ Financial bounties available. Any reasonable suggested value will be seriously considered.
I welcome contact / replies from developers in the dlib community who are interested to work on this project.
@davisking / maintainers:
There are a few ways to accomplish this, the best would be using gcc's vector code and allowing the compiler to handle the job. This would work out of the box for at least 4 different base architectures. They would not work with visual studio however, and the current sse intrinsics do.
I could:
I would suggest using separate headers and the gcc vectors as it will have no impact on existing code, and any new gcc with supported targets will see a nice boost.
Please let me know what you prefer.
I don't see how putting the code in cpp files could work out well since inlining is very important for the speed of these functions. So a header only option is a must. Separate headers for different architectures (for the code within the dlib/simd) folder is probably the cleanest way to go about it.
Also, the most important benchmark for speed improvements is probably the fhog based object detector in dlib. So that's what you should test to see if you are getting performance improvements.
Anyway, I would welcome such a patch. You will have to talk to @edelsohn to find out the $$$$ details :)
Thanks for the quick response,
I will go ahead with the inline'd headers then,
i'll make sure to bench against the existing code with mentioned fhog obj. detector.
Cool, no problem. Do talk to @edelsohn about the price you want though. He is the one offering the bounty :)
I agree that inlining is important and the functions must be in headers, but I don't understand the objection to using GCC/LLVM vector operations. In fact, SSE/AVX intrinsics (mmintrin.h) are implemented as vector operations, not direct SSE/AVX operations.
I don't think there was any objection to it?
Unless I missed it
Then why are "Separate headers for different architectures (for the code within the dlib/simd folder)" necessary? The vector code isn't architecture-specific.
They support microsofts compiler, and there is a plain c++ version.
So basically i'm going to leave that code as-is, and use the preprocessor to ensure builds that support vector ops use the new version. This ensure pre existing applications will work properly.
I'm fine with anything so long as existing use cases (e.g. visual studio, arch linux, whatever) aren't disrupted. The decision about being in one header or two comes down to what leads to the most readable code. If there is a simple way to do it in one header that's fine, but if something about the structure is difficult to do in a readable way with one header then two are fine as well. I'm not familiar with the GCC/LLVM vector ops so it's not obvious to me which is clearer.
Another consideration is the ARM NEON version of the simd code @fastfastball made (see https://github.com/davisking/dlib/issues/276 and https://github.com/fastfastball/arm_neon_for_dlib_simd). It hasn't been merged into the dlib main codebase yet so it would be nice if whatever changes are made here don't create excessive difficulties when eventually the NEON code is merged into dlib.
I will take a look at the neon code.
@dmiller423 & @davisking , I think the way dmiller423 will use to implement vector code will not impact ARM NEON code. I am now expecting to see if vector code can outperform NEON code on ARM platform.
And, I have 2 suggestions after reading your discussion
1) Test code for simd: The test code I mention is to verify the correctness(at least) and performance. I planed to do this when implement NEON code for dlib. However, I don't have enough time to complete it. So, if dmiller423 can do this, that will benefit the maintenance of simd code of dlib in the future :)
2) LLVM v.s GCC : I don't know which compiler will compile the vector code into the binary code with better performance. It will be nice if dmiller423 can test it and show the result :)
It should be noted the vector code and the intrinsics work much the same, actually I did a very quick check (on power8) and came out with the exact same code. On a larger example I'd expect the compiler may have better optimization opportunities, either way there really won't be much difference. I'm doing it this way because it makes the most sense, as it's the most compatible and the way gcc is moving forward with simd code across multiple architectures.
https://github.com/davisking/dlib/pull/414
Initial gcc vector code is up and running, tested on x86_64 and Power8/64.
I also have a vec_ (power8 only) optimized version as well as a single header version using templates, but it became a royal pain to debug.
Super. Let me know when the PR is ready to review for merge into dlib.
Also, there are now two active pull requests for this issue. Not sure how you guys ( @dmiller423 @edelsohn @barkovv) want to handle this. It probably only makes sense to merge one. In any event, let me know when one of you have a PR that you think is ready and I'll look it over.
Also, since @edelsohn is the requestor, maybe he will want some specific performance tests to decide which implementation is best? I imagine the speed of the fhog object detector is ultimately the metric of interest, but I don't want to assume.
My criteria is ideal speedup, not absolute performance.
How does the x86-64 performance using GCC/LLVM vector operations compare with the original, hand-written x86-64 intrinsics? Is the x86-64 vector code equivalent in performance?
Is the SIMD speedup on POWER8 VSX vs scalar equivalent to x86-64 AVX vs scalar for the same SIMD width?
The pull request is review ready.
I have made a few performance tests on x64 and power8 against scalar code.
I am going to write up a bit of info after doing some more and comparing.
I will post the results when i'm finished.
Sounds good. I'll review it once you post that info. I did notice the code's tabbing is messed up though (https://github.com/davisking/dlib/pull/414/files) since it's a mixture of spaces and tabs. Can you use spaces and make sure the tabbing matches the rest of the dlib code?
@dmiller423 do your code pass "hash" test?
I check code of "hash" test and tested it with and without VSX optimizations. It seems that VSX doesn't affect this problem so I created new issue #415 for that.
@davisking
If pull request from @dmiller423 passes "hash" test, I guess you should accept his code instead of mine.
PS
Could you advice me some good profiler for ppc64? My pull request is ready but I can't measure profit of VSX code compared to original. Thanks
operf or perf are the common profiling solutions on ppc64 linux.
@dmiller423 , @edelsohn
Nevermind my result's post. I made mistake in configuration and was comparing nonoptimized and nonoptimized :D . Now I turned optimization on and get many errors of compilations. Sorry for my silly mistake.
I have results, unfortunately they are Power/VSX only.
I will investigate auto-vectorization further at some point,
the compiler just isn't mixing all of the inlined code as I would have hoped.
It actually has similar results to the C code (and ignores some of the inlining since it's not forced)
Here are final results with vector instrinsics:


As you can see the vsx variant (bottom) is fully inlined and comparable with the speed of AVX, which is very surprising. I would check it against SSE w.o AVX but there is really no need at this point is there?
Sorry about the symbol mangling, I can't seem to get perf to unmangle properly on this platform and I haven't taken the time to debug it further.
I don't really understand what those images are showing me. The fraction of time spent in each function? I'm not sure how that tells us what is faster. Maybe the whole program is 10x slower with some particular optimization but spends overall some % less in the functions we are interested in. But that doesn't mean it's faster. The relevant metrics are milliseconds to execute the face detector, fhog extraction, etc. with and without the optimizations.
Well this is the face detector and it waits for key input, I will stop it from waiting and run a simple perf stats to show you. However what it does show is that the cpu time spent while running is redistributed away from the simd8 code. I'll run some time tests now.
Yes, those are useful diagnostics to understand what's happening. However,
no users care about the relative spread of CPU time over different parts of
dlib code :) For the PR to be accepted what matters is seeing a reduction
in execution time (according to the wall clock).
Well the speed is then relative, here's the proof:
ubuntu@ubuntu-16:~/dlib-gcc7-none$ sudo perf stat ./face_detection_ex ./faces/2007_007763.jpg
processing image ./faces/2007_007763.jpg
Number of faces detected: 7
Performance counter stats for './face_detection_ex ./faces/2007_007763.jpg':
811.869408 task-clock (msec) # 0.999 CPUs utilized
5 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
390 page-faults # 0.480 K/sec
2,991,381,796 cycles # 3.685 GHz (66.50%)
48,076,000 stalled-cycles-frontend # 1.61% frontend cycles idle (49.75%)
1,872,272,383 stalled-cycles-backend # 62.59% backend cycles idle (50.21%)
2,857,961,772 instructions # 0.96 insn per cycle
# 0.66 stalled cycles per insn (67.31%)
341,971,733 branches # 421.215 M/sec (50.28%)
18,372,995 branch-misses # 5.37% of all branches (49.81%)
0.812286628 seconds time elapsed
ubuntu@ubuntu-16:~/dlib-gcc7-vsx$ sudo perf stat ./face_detection_ex ./faces/2007_007763.jpg
processing image ./faces/2007_007763.jpg
Number of faces detected: 7
Performance counter stats for './face_detection_ex ./faces/2007_007763.jpg':
408.551834 task-clock (msec) # 0.999 CPUs utilized
1 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
390 page-faults # 0.955 K/sec
1,505,111,030 cycles # 3.684 GHz (66.71%)
28,788,911 stalled-cycles-frontend # 1.91% frontend cycles idle (50.08%)
880,596,128 stalled-cycles-backend # 58.51% backend cycles idle (50.08%)
1,766,269,773 instructions # 1.17 insn per cycle
# 0.50 stalled cycles per insn (66.72%)
186,892,769 branches # 457.452 M/sec (50.05%)
9,504,368 branch-misses # 5.09% of all branches (50.05%)
0.408893233 seconds time elapsed
I suppose i just should have pasted the time elapsed, but it's 2x the real speed.
What is the relative speedup for x86-64 SIMD of the same SIMD width?
fhog_object_detector ./faces (entire dir)
gcc7-none: 22.372478922 seconds time elapsed
gcc7-vsx: 14.357527564 seconds time elapsed
@edelsohn when i ran valgrind on it with AVX (256b) simd ( i meant to do SSE only but forgot to disable AVX ) it was around the same (2x). I'm sure AVX wins in the long run for 256b operations, but VSX certainly can hold it's own against it here. I can test again for SSE if you like?
AVX (not AVX512) is a good comparison. This testcase runs as double precision floating point?
no single precision floating point and integer
AVX is 128 bit. AVX2 is 256 bit. AVX512 is 512 bit.
Single precision floating point and integer are 32 bit, and 4 should pack into VSX and AVX SIMD. I would have expected closer to 4x improvement for both PPC64 and x86-64, but the results are the same for both.
If the speedup is equivalent for both 128 bit SIMD architectures, I'm satisfied.
There is a great deal of overhead, to achieve performance equivialent to that, you have to map all inputs out ahead of time, set them all in simd regs and handle the entire job at once. Basically you have to treat it as a coprocessor, since moving between VRs and GPRs cause pipeline stalls and have high latency.
Don't time the whole program, just the part that does face detection. You don't need to be timing jpeg decoding for instance.
Although I'm not complaining, it looks like it's a lot faster :)
ubuntu@ubuntu-16:~/dlib-gcc7-none$ ./face_detection_ex ./faces/2007_007763.jpg
processing image ./faces/2007_007763.jpg
detector time: 0.575958 second
Number of faces detected: 7
ubuntu@ubuntu-16:~/dlib-gcc7-vsx$ ./face_detection_ex ./faces/2007_007763.jpg
processing image ./faces/2007_007763.jpg
detector time: 0.166241 second
Number of faces detected: 7
strictly on detector speed, it is nearly 3.5x as fast
Awesome :)
@edelsohn ok to close this out?
I'm satisfied. I would expect @davisking to close the issue.
I'm good.
@davisking thanks again for the prompt replies/review
No problem :)
I am trying to compile the face_detection_ex.cpp example on a Power system but GCC complains a lot about the simd.h file. I am using GCC 5.4 on Ubuntu 16.04. To get the code compiled I have to disable the SIMD optimizations but this is not ideal.
/tmp/dlib-19.13/examples# g++ -std=c++11 -O3 -I.. -lpthread -lX11 face_detection_ex.cpp
In file included from ../dlib/image_processing/../image_processing/../image_transforms/../simd/simd4f.h:6:0,
from ../dlib/image_processing/../image_processing/../image_transforms/../simd.h:6,
from ../dlib/image_processing/../image_processing/../image_transforms/spatial_filtering.h:13,
from ../dlib/image_processing/../image_processing/../image_transforms.h:14,
from ../dlib/image_processing/../image_processing/scan_fhog_pyramid.h:8,
from ../dlib/image_processing/frontal_face_detector.h:8,
from face_detection_ex.cpp:40:
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse2_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:145:79: error: cannot convert 'bool' to '__vector(4) __bool int' in return
inline bool cpu_has_sse2_instructions() { return 0!=(cpuid(1)[3]&(1<<26)); }
^
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse3_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:146:78: error: cannot convert 'bool' to '__vector(4) __bool int' in return
inline bool cpu_has_sse3_instructions() { return 0!=(cpuid(1)[2]&(1<<0)); }
^
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse41_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:147:79: error: cannot convert 'bool' to '__vector(4) __bool int' in return
inline bool cpu_has_sse41_instructions() { return 0!=(cpuid(1)[2]&(1<<19)); }
^
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse42_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:148:79: error: cannot convert 'bool' to '__vector(4) __bool int' in return
inline bool cpu_has_sse42_instructions() { return 0!=(cpuid(1)[2]&(1<<20)); }
^
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_avx_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:149:79: error: cannot convert 'bool' to '__vector(4) __bool int' in return
inline bool cpu_has_avx_instructions() { return 0!=(cpuid(1)[2]&(1<<28)); }
I don't see how you can get that error since cpuid() returns a std::array
not __vector. Unless cpuid() is some magic function on your system and
just needs to be renamed to avoid a name clash.
GCC 5.4 is extremely outdated. Why is the code testing for x86 features on Power?
Look at the code, I don't think it is.
I tried with GCC 7.3.1 from Advance-Toolchain 11 but I have the exact same issue.
I also tried to rename cpuid to cpuid2 in dlib/simd/simd_check.h but that doesn't solved the issue.
Is the include dlib/simd/simd_check.h really needed when compiling on Power as it seems to mainly check for x86 cpu features ?
It's not related to powerpc. But the error you are getting doesn't make sense. It's saying that this code:
std::array<unsigned int,4> cpuid(int) { return std::array<unsigned int,4>{}; }
inline bool cpu_has_avx_instructions() { return 0!=(cpuid(1)[2]&(1<<28)); }
is the issue. But that code is fine. Maybe your compiler is just broken. Right, read the error:
error: cannot convert 'bool' to '__vector(4) __bool int' in return
inline bool cpu_has_avx_instructions() { return 0!=(cpuid(1)[2]&(1<<28)); }
That makes no sense since that function doesn't return a vector of bools. It's just returning a bool.
Anyway, maybe there is a workaround. You need to find some minimal piece of code that generates the error and then you will be able to find some workaround based on that. But the underlying issue seems to be some weird compiler bug.
Sorry I've been away: the problem looks like either the compiler is not defaulting to the correct cpu arch.
IE: you're using a generic powerpc gcc and not one built and tuned for power8 w/ vsx ( most likely )
Or else something has broken the simd_check headers detection, changes elsewhere can do this and simd then defaults to SSE because of how dlib simd was originally written.
you can check the preprocessor definitions of default gcc and see if VSX is defined
( which you'll find is necessary in simd_check.h , iirc gcc -dM -E - | grep VSX ought to do it )
Shouldn't take more than 2mins to figure out what the problem is, and changing the whole structure of simd handling or adding extra check scripts to build wasn't really a great option :|
Also note: the __vector(N) __bool types and similar are how powerpc simd internals are referenced by the frontend, while i don't agree with it : it's what we have.. It can be a pain in some cases when mixing with C++ but is not the problem you're having now.
If the default cpu arch/tune are the problem as I suspect you can just add -mcpu -mtune to the build and default to generic power8le.
There is a way to default the whole toolchain w.o rebuilding with a script, It involves digging into dumping the toolchains store and editing and replacing as a script file somewhere in the toolchain fs root : just FYI
@dmiller423 Did you look at the cpuid() code in simd_check.h? I don't see how this can be related to powerpc. The bit of code generating the error is distinct from any of the simd code in dlib, it just happens to be located in the simd_check.h header.
@davisking sure, it's dependent on either supported x86 toolchain or else it just returns an uint32[4] so that extra ifdef's don't have to be used elsewhere to make the project link. It's completely harmless... the point of the simd_check.h is to filter the required definitions for SSE, AVX, VSX etc w.o having to do more complicated build checks... The flipside to this is cases where VSX is not defined, maybe a specific check might be in order
#if defined(__powerpc) && !defined(__VSX__)
#errror PowerPC w.o VSX detected check flags
#endif
ppc64le compilers built w.o any specific target end up being defaulted to 800 series hardware that has altivec(vmx) but no vsx and it's more of a pain than dealing w. x86 which is generally tuned to at least p5 w. basic sse and most compilers have runtime checks or cpuid is tested at runtime to determine feature set.
VSX can be tested for on ppc but it differs by OS if you want to do so from user mode, and can differ by arch some and would require root which is obv. not great. If you tell the compiler to just build with the flags it's just as much of a mess... So I of course chose the simplest path, tho perhaps the warning/error is in order?
I still don't follow. The error seems to come from this code:
inline std::array<unsigned int,4> cpuid(int)
{
return std::array<unsigned int,4>{};
}
inline bool cpu_has_sse2_instructions() { return 0!=(cpuid(1)[3]&(1<<26)); }
That code doesn't have anything to do with any kind of SIMD or hardware specific features whatsoever.
I'd have to look a lot more thoroughly, I believe the only reason that is in the code is it's used outside of cpu_has_* functions somewhere as well.. If not the best thing to do would be to cut the cpuid() out of the #else it's in , and then add another wrapping all of the cpu_has_*() instead so it's only built on X86 && X64 ... since there is simd code for ARM and PPC.
or you could change them to macro's or force the inlining since they should only be compiled if DLIB_HAVE_{SSE,AVX} are enabled below ... which is the (one) reason they haven't been a problem in the past: they should never be called on non x86 based architectures anyhow.
cpuid() isn't called outside that code snippet.
I suspect this is some kind of error related to bad support for std::array. But who knows. @geoffrey-pascal seems inactive.
I tried to run gcc -dM -E - | grep VSX but it doesn't return anything. I am using the gcc compiler from Ubuntu 16.04.
$ gcc -x c -E /dev/null -g3 -o -
shows that GCC defines
__VSX__
__POWER8_VECTOR__
GCC seems ok :
$ gcc -x c -E /dev/null -g3 -o - | grep VSX
#define __VSX__ 1
$ gcc -x c -E /dev/null -g3 -o - | grep VECTOR
#define __POWER8_VECTOR__ 1
$gcc -x c -E /dev/null -g3 -o - | grep ALTIVEC
#define __ALTIVEC__ 1
#define __APPLE_ALTIVEC__ 1
Sorry wrong format for preproc dump apparently, was on mobile at the time.
I'm on VM now, give me a min and i'll see if I can reproduce your problem.
@geoffrey-pascal I don't have X on my VM atm, but all of the non UI examples and the library build properly...
You aren't using any of the -DUSE_{AVX,SSE}_INSTRUCTIONS=1; during build by any chance?
There is very little reason that one example should build and another fail, especially another detection example and i've tested those.
I have of course tested with UI support when I added the code as well... please let me know.
It's easy to understand if you pasted the -DUSE_AVX_INSTRUCTIONS=1; like it says in the readme.
matched gcc in version # at least if not all build specs:
gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
@dmiller423 I am using the command below for the build. I am not using -DUSE_AVX_INSTRUCTIONS=1
g++ -std=c++11 -O3 -I.. -lpthread -lX11 face_detection_ex.cpp
I forgot to mention before that I am working on a Power 9 system
What if you put this in a test.cpp file and try to compile it?
#include <array>
inline std::array<unsigned int,4> cpuid(int)
{
return std::array<unsigned int,4>{};
}
inline bool cpu_has_sse2_instructions() { return 0!=(cpuid(1)[3]&(1<<26)); }
int main() { cpu_has_sse2_instructions(); }
Does it compile?
I tried with GCC 4.8 and 5.4 and it compiles for both versions.
What if you just #include simd_check.h?
You should try and find a minimal example that causes the error. For some reason that code works in that test.cpp but not inside simd_check.h. Maybe there is a #define that interferes. Maybe cpuid is #defined to something and it's imported by some header.
When I include simd_check.h I have the error
$ g++ -std=c++11 -O3 test.cpp
In file included from test.cpp:2:0:
simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse2_instructions()':
simd/simd_check.h:145:80: error: cannot convert 'bool' to '__vector(4) __bool int' in return
inline bool cpu_has_sse2_instructions() { return 0!=(cpuid2(1)[3]&(1<<26)); }
I tried to rename cpuid to cpuid2 but it produces the same error. I also tried to grep cpuid inside the sources to check if it's defined somewhere but I haven't found anything
Ok, so now start deleting stuff from simd_check.h and see what makes the
error go away. You can figure it out :)
Well the error will be a bit tricky to find that way, it's most likely a conversion error (cast) from the cast operator and this constructor ::
inline simd4i(const rawarray& a) : x{a.v} { }
unless the compiler is trying to auto vector it and is extremely broken.
try commenting that out and see if the error changes...
Either way the options i posted above about ways to eliminate the cpuid check for powerpc has to work, since it's not needed, useful and does nothing at this point.
Quick impl and tested here on examples:
// ----------------------------------------------------------------------------------------
// Define functions to check, at runtime, what instructions are available
#include <intrin.h>
inline std::array<unsigned int,4> cpuid(int function_id)
{
std::array<unsigned int,4> info;
// Load EAX, EBX, ECX, EDX into info
__cpuid((int*)info.data(), function_id);
return info;
}
#include <cpuid.h>
inline std::array<unsigned int,4> cpuid(int function_id)
{
std::array<unsigned int,4> info;
// Load EAX, EBX, ECX, EDX into info
__cpuid(function_id, info[0], info[1], info[2], info[3]);
return info;
}
inline bool cpu_has_sse2_instructions() { return 0!=(cpuid(1)[3]&(1<<26)); }
inline bool cpu_has_sse3_instructions() { return 0!=(cpuid(1)[2]&(1<<0)); }
inline bool cpu_has_sse41_instructions() { return 0!=(cpuid(1)[2]&(1<<19)); }
inline bool cpu_has_sse42_instructions() { return 0!=(cpuid(1)[2]&(1<<20)); }
inline bool cpu_has_avx_instructions() { return 0!=(cpuid(1)[2]&(1<<28)); }
inline bool cpu_has_avx2_instructions() { return 0!=(cpuid(7)[1]&(1<<5)); }
inline bool cpu_has_avx512_instructions() { return 0!=(cpuid(7)[1]&(1<<16)); }
inline void warn_about_unavailable_but_used_cpu_instructions()
{
if (!cpu_has_avx2_instructions())
std::cerr << "Dlib was compiled to use AVX2 instructions, but these aren't available on your machine." << std::endl;
if (!cpu_has_avx_instructions())
std::cerr << "Dlib was compiled to use AVX instructions, but these aren't available on your machine." << std::endl;
if (!cpu_has_sse41_instructions())
std::cerr << "Dlib was compiled to use SSE41 instructions, but these aren't available on your machine." << std::endl;
if (!cpu_has_sse3_instructions())
std::cerr << "Dlib was compiled to use SSE3 instructions, but these aren't available on your machine." << std::endl;
if (!cpu_has_sse2_instructions())
std::cerr << "Dlib was compiled to use SSE2 instructions, but these aren't available on your machine." << std::endl;
}
I don't think we are all looking at the same thing. @geoffrey-pascal just said that this program causes the error as well:
#include <array>
#include <dlib/simd/simd_check.h>
inline std::array<unsigned int,4> cpuid(int)
{
return std::array<unsigned int,4>{};
}
inline bool cpu_has_sse2_instructions() { return 0!=(cpuid(1)[3]&(1<<26)); }
int main() { cpu_has_sse2_instructions(); }
However, simd_check.h doesn't include anything other than iostream and array. So that program doesn't include simd4i or any other simd code. So how can the simd code have anything to do with the error?
@davisking It's a good question, if it's never actually used in conjunction with the simd class it _shouldn't_ : however I see no other possibility since if compiles fine _without_ the simd headers... Unless including the altivec headers and or code using vsx intrinsics causes auto vectorization (and a bug there).
So i'm going back to the original use and fix for such: since he wants to use it for dlib: the fix is simple, it removes the function completely as it's not needed.
Note: like the comment suggests, the ifdef checking for vsx is a bit backwards, it's tech not required for ARM either or anything non x86 based.. for now though it was a quick solution with little change / KISS
@davisking I have reported numerous bugs in powerpc gcc < version 8, I believe this is quite likely another tho i'm unable to reproduce. If he was using a public VM or it was easy to reproduce i'd report it to Bill Seurer @IBM ( @edelsohn if you want to it's up to you, i'm not sure what he can do w.o a clear means of reproduction. maybe if @geoffrey-pascal wants to share his VM or full details of his configration. Either way gcc 8 after many reports and a big overhaul is much better and may not have the issue at all. )
That's all fine, but any code change needs to be verified to fix the issue. I don't want to merge any PRs that are just guesses at what might address the problem.
I wasn't suggesting a pull req, since I can't reproduce the only real resolution is to find something that works through the issue he's having. If anyone else has an issue I would happily look into it, seems isolated as I know several people have used it on power8 VMs.
I made it work on P8 like a breathe I don't understand why GCC is complaining on P9. What do you suggest ? Try with the latest GCC ?
I've mentioned numerous means of definite fixes, and some more likely to be a better long term solution.
Since I don't have access to the machine or these issues myself, that's really all I can do.
Most helpful comment
strictly on detector speed, it is nearly 3.5x as fast