@davisking / maintainers:

There are a few ways to accomplish this, the best would be using gcc's vector code and allowing the compiler to handle the job. This would work out of the box for at least 4 different base architectures. They would not work with visual studio however, and the current sse intrinsics do.

I could:

Add vsx intrinsics with further preprocessor checks similar to SSE code there now.
Split it into separate headers and implement the classes based on availability, and the options there are many with one set of headers containing vector code or intrinsics.
Use the header as definition only, make cpp files with implementation that are built based on build time checks (runtime checks could be used, unnecessary though if compiler is already building a binary with support)

I would suggest using separate headers and the gcc vectors as it will have no impact on existing code, and any new gcc with supported targets will see a nice boost.

Please let me know what you prefer.

dmiller423 on 24 Jan 2017

I don't see how putting the code in cpp files could work out well since inlining is very important for the speed of these functions. So a header only option is a must. Separate headers for different architectures (for the code within the dlib/simd) folder is probably the cleanest way to go about it.

Also, the most important benchmark for speed improvements is probably the fhog based object detector in dlib. So that's what you should test to see if you are getting performance improvements.

Anyway, I would welcome such a patch. You will have to talk to @edelsohn to find out the $$$$ details :)

davisking on 24 Jan 2017

Thanks for the quick response,

I will go ahead with the inline'd headers then,
i'll make sure to bench against the existing code with mentioned fhog obj. detector.

dmiller423 on 24 Jan 2017

Cool, no problem. Do talk to @edelsohn about the price you want though. He is the one offering the bounty :)

davisking on 24 Jan 2017

I agree that inlining is important and the functions must be in headers, but I don't understand the objection to using GCC/LLVM vector operations. In fact, SSE/AVX intrinsics (mmintrin.h) are implemented as vector operations, not direct SSE/AVX operations.

edelsohn on 25 Jan 2017

I don't think there was any objection to it?
Unless I missed it

dmiller423 on 25 Jan 2017

Then why are "Separate headers for different architectures (for the code within the dlib/simd folder)" necessary? The vector code isn't architecture-specific.

edelsohn on 25 Jan 2017

They support microsofts compiler, and there is a plain c++ version.
So basically i'm going to leave that code as-is, and use the preprocessor to ensure builds that support vector ops use the new version. This ensure pre existing applications will work properly.

dmiller423 on 25 Jan 2017

I'm fine with anything so long as existing use cases (e.g. visual studio, arch linux, whatever) aren't disrupted. The decision about being in one header or two comes down to what leads to the most readable code. If there is a simple way to do it in one header that's fine, but if something about the structure is difficult to do in a readable way with one header then two are fine as well. I'm not familiar with the GCC/LLVM vector ops so it's not obvious to me which is clearer.

Another consideration is the ARM NEON version of the simd code @fastfastball made (see https://github.com/davisking/dlib/issues/276 and https://github.com/fastfastball/arm_neon_for_dlib_simd). It hasn't been merged into the dlib main codebase yet so it would be nice if whatever changes are made here don't create excessive difficulties when eventually the NEON code is merged into dlib.

davisking on 25 Jan 2017

I will take a look at the neon code.

dmiller423 on 25 Jan 2017

@dmiller423 & @davisking , I think the way dmiller423 will use to implement vector code will not impact ARM NEON code. I am now expecting to see if vector code can outperform NEON code on ARM platform.

And, I have 2 suggestions after reading your discussion
1) Test code for simd: The test code I mention is to verify the correctness(at least) and performance. I planed to do this when implement NEON code for dlib. However, I don't have enough time to complete it. So, if dmiller423 can do this, that will benefit the maintenance of simd code of dlib in the future :)
2) LLVM v.s GCC : I don't know which compiler will compile the vector code into the binary code with better performance. It will be nice if dmiller423 can test it and show the result :)

fastfastball on 25 Jan 2017

It should be noted the vector code and the intrinsics work much the same, actually I did a very quick check (on power8) and came out with the exact same code. On a larger example I'd expect the compiler may have better optimization opportunities, either way there really won't be much difference. I'm doing it this way because it makes the most sense, as it's the most compatible and the way gcc is moving forward with simd code across multiple architectures.

dmiller423 on 25 Jan 2017

👍1

https://github.com/davisking/dlib/pull/414

Initial gcc vector code is up and running, tested on x86_64 and Power8/64.

dmiller423 on 27 Jan 2017

I also have a vec_ (power8 only) optimized version as well as a single header version using templates, but it became a royal pain to debug.

dmiller423 on 27 Jan 2017

Super. Let me know when the PR is ready to review for merge into dlib.

Also, there are now two active pull requests for this issue. Not sure how you guys ( @dmiller423 @edelsohn @barkovv) want to handle this. It probably only makes sense to merge one. In any event, let me know when one of you have a PR that you think is ready and I'll look it over.

Also, since @edelsohn is the requestor, maybe he will want some specific performance tests to decide which implementation is best? I imagine the speed of the fhog object detector is ultimately the metric of interest, but I don't want to assume.

davisking on 27 Jan 2017

My criteria is ideal speedup, not absolute performance.

How does the x86-64 performance using GCC/LLVM vector operations compare with the original, hand-written x86-64 intrinsics? Is the x86-64 vector code equivalent in performance?

Is the SIMD speedup on POWER8 VSX vs scalar equivalent to x86-64 AVX vs scalar for the same SIMD width?

edelsohn on 27 Jan 2017

The pull request is review ready.

I have made a few performance tests on x64 and power8 against scalar code.
I am going to write up a bit of info after doing some more and comparing.
I will post the results when i'm finished.

dmiller423 on 27 Jan 2017

Sounds good. I'll review it once you post that info. I did notice the code's tabbing is messed up though (https://github.com/davisking/dlib/pull/414/files) since it's a mixture of spaces and tabs. Can you use spaces and make sure the tabbing matches the rest of the dlib code?

davisking on 27 Jan 2017

@dmiller423 do your code pass "hash" test?

barkovv on 27 Jan 2017

I check code of "hash" test and tested it with and without VSX optimizations. It seems that VSX doesn't affect this problem so I created new issue #415 for that.

@davisking
If pull request from @dmiller423 passes "hash" test, I guess you should accept his code instead of mine.

PS
Could you advice me some good profiler for ppc64? My pull request is ready but I can't measure profit of VSX code compared to original. Thanks

barkovv on 27 Jan 2017

operf or perf are the common profiling solutions on ppc64 linux.

edelsohn on 27 Jan 2017

@dmiller423 , @edelsohn
Nevermind my result's post. I made mistake in configuration and was comparing nonoptimized and nonoptimized :D . Now I turned optimization on and get many errors of compilations. Sorry for my silly mistake.

barkovv on 29 Jan 2017

😄1

I have results, unfortunately they are Power/VSX only.
I will investigate auto-vectorization further at some point,
the compiler just isn't mixing all of the inlined code as I would have hoped.
It actually has similar results to the C code (and ignores some of the inlining since it's not forced)

Here are final results with vector instrinsics:

dlib_perf_none
dlib_perf_vsx

As you can see the vsx variant (bottom) is fully inlined and comparable with the speed of AVX, which is very surprising. I would check it against SSE w.o AVX but there is really no need at this point is there?

dmiller423 on 1 Feb 2017

Sorry about the symbol mangling, I can't seem to get perf to unmangle properly on this platform and I haven't taken the time to debug it further.

dmiller423 on 1 Feb 2017

I don't really understand what those images are showing me. The fraction of time spent in each function? I'm not sure how that tells us what is faster. Maybe the whole program is 10x slower with some particular optimization but spends overall some % less in the functions we are interested in. But that doesn't mean it's faster. The relevant metrics are milliseconds to execute the face detector, fhog extraction, etc. with and without the optimizations.

davisking on 1 Feb 2017

Well this is the face detector and it waits for key input, I will stop it from waiting and run a simple perf stats to show you. However what it does show is that the cpu time spent while running is redistributed away from the simd8 code. I'll run some time tests now.

dmiller423 on 1 Feb 2017

Yes, those are useful diagnostics to understand what's happening. However,
no users care about the relative spread of CPU time over different parts of
dlib code :) For the PR to be accepted what matters is seeing a reduction
in execution time (according to the wall clock).

davisking on 1 Feb 2017

Well the speed is then relative, here's the proof:

ubuntu@ubuntu-16:~/dlib-gcc7-none$ sudo perf stat ./face_detection_ex ./faces/2007_007763.jpg
processing image ./faces/2007_007763.jpg
Number of faces detected: 7

Performance counter stats for './face_detection_ex ./faces/2007_007763.jpg':

    811.869408      task-clock (msec)         #    0.999 CPUs utilized
             5      context-switches          #    0.006 K/sec
             0      cpu-migrations            #    0.000 K/sec
           390      page-faults               #    0.480 K/sec
 2,991,381,796      cycles                    #    3.685 GHz                      (66.50%)
    48,076,000      stalled-cycles-frontend   #    1.61% frontend cycles idle     (49.75%)
 1,872,272,383      stalled-cycles-backend    #   62.59% backend cycles idle      (50.21%)
 2,857,961,772      instructions              #    0.96  insn per cycle
                                              #    0.66  stalled cycles per insn  (67.31%)
   341,971,733      branches                  #  421.215 M/sec                    (50.28%)
    18,372,995      branch-misses             #    5.37% of all branches          (49.81%)

   0.812286628 seconds time elapsed

ubuntu@ubuntu-16:~/dlib-gcc7-vsx$ sudo perf stat ./face_detection_ex ./faces/2007_007763.jpg
processing image ./faces/2007_007763.jpg
Number of faces detected: 7

Performance counter stats for './face_detection_ex ./faces/2007_007763.jpg':

    408.551834      task-clock (msec)         #    0.999 CPUs utilized
             1      context-switches          #    0.002 K/sec
             0      cpu-migrations            #    0.000 K/sec
           390      page-faults               #    0.955 K/sec
 1,505,111,030      cycles                    #    3.684 GHz                      (66.71%)
    28,788,911      stalled-cycles-frontend   #    1.91% frontend cycles idle     (50.08%)
   880,596,128      stalled-cycles-backend    #   58.51% backend cycles idle      (50.08%)
 1,766,269,773      instructions              #    1.17  insn per cycle
                                              #    0.50  stalled cycles per insn  (66.72%)
   186,892,769      branches                  #  457.452 M/sec                    (50.05%)
     9,504,368      branch-misses             #    5.09% of all branches          (50.05%)

   0.408893233 seconds time elapsed

dmiller423 on 1 Feb 2017

I suppose i just should have pasted the time elapsed, but it's 2x the real speed.

dmiller423 on 1 Feb 2017

What is the relative speedup for x86-64 SIMD of the same SIMD width?

edelsohn on 1 Feb 2017

fhog_object_detector ./faces (entire dir)
gcc7-none: 22.372478922 seconds time elapsed
gcc7-vsx: 14.357527564 seconds time elapsed

dmiller423 on 1 Feb 2017

@edelsohn when i ran valgrind on it with AVX (256b) simd ( i meant to do SSE only but forgot to disable AVX ) it was around the same (2x). I'm sure AVX wins in the long run for 256b operations, but VSX certainly can hold it's own against it here. I can test again for SSE if you like?

dmiller423 on 1 Feb 2017

AVX (not AVX512) is a good comparison. This testcase runs as double precision floating point?

edelsohn on 1 Feb 2017

no single precision floating point and integer

dmiller423 on 1 Feb 2017

AVX is 128 bit. AVX2 is 256 bit. AVX512 is 512 bit.

Single precision floating point and integer are 32 bit, and 4 should pack into VSX and AVX SIMD. I would have expected closer to 4x improvement for both PPC64 and x86-64, but the results are the same for both.

If the speedup is equivalent for both 128 bit SIMD architectures, I'm satisfied.

edelsohn on 1 Feb 2017

There is a great deal of overhead, to achieve performance equivialent to that, you have to map all inputs out ahead of time, set them all in simd regs and handle the entire job at once. Basically you have to treat it as a coprocessor, since moving between VRs and GPRs cause pipeline stalls and have high latency.

dmiller423 on 1 Feb 2017

Don't time the whole program, just the part that does face detection. You don't need to be timing jpeg decoding for instance.

Although I'm not complaining, it looks like it's a lot faster :)

davisking on 1 Feb 2017

ubuntu@ubuntu-16:~/dlib-gcc7-none$ ./face_detection_ex ./faces/2007_007763.jpg
processing image ./faces/2007_007763.jpg
detector time: 0.575958 second
Number of faces detected: 7

ubuntu@ubuntu-16:~/dlib-gcc7-vsx$ ./face_detection_ex ./faces/2007_007763.jpg
processing image ./faces/2007_007763.jpg
detector time: 0.166241 second
Number of faces detected: 7

dmiller423 on 1 Feb 2017

strictly on detector speed, it is nearly 3.5x as fast

dmiller423 on 1 Feb 2017

🎉2

Awesome :)

davisking on 1 Feb 2017

@edelsohn ok to close this out?

dmiller423 on 2 Feb 2017

I'm satisfied. I would expect @davisking to close the issue.

edelsohn on 2 Feb 2017

I'm good.

davisking on 2 Feb 2017

@davisking thanks again for the prompt replies/review

dmiller423 on 2 Feb 2017

No problem :)

davisking on 2 Feb 2017

I am trying to compile the face_detection_ex.cpp example on a Power system but GCC complains a lot about the simd.h file. I am using GCC 5.4 on Ubuntu 16.04. To get the code compiled I have to disable the SIMD optimizations but this is not ideal.

/tmp/dlib-19.13/examples# g++ -std=c++11 -O3 -I.. -lpthread -lX11 face_detection_ex.cpp 
In file included from ../dlib/image_processing/../image_processing/../image_transforms/../simd/simd4f.h:6:0,
                 from ../dlib/image_processing/../image_processing/../image_transforms/../simd.h:6,
                 from ../dlib/image_processing/../image_processing/../image_transforms/spatial_filtering.h:13,
                 from ../dlib/image_processing/../image_processing/../image_transforms.h:14,
                 from ../dlib/image_processing/../image_processing/scan_fhog_pyramid.h:8,
                 from ../dlib/image_processing/frontal_face_detector.h:8,
                 from face_detection_ex.cpp:40:
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse2_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:145:79: error: cannot convert 'bool' to '__vector(4) __bool int' in return
     inline bool cpu_has_sse2_instructions()   { return 0!=(cpuid(1)[3]&(1<<26)); }
                                                                               ^
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse3_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:146:78: error: cannot convert 'bool' to '__vector(4) __bool int' in return
     inline bool cpu_has_sse3_instructions()   { return 0!=(cpuid(1)[2]&(1<<0));  }
                                                                              ^
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse41_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:147:79: error: cannot convert 'bool' to '__vector(4) __bool int' in return
     inline bool cpu_has_sse41_instructions()  { return 0!=(cpuid(1)[2]&(1<<19)); }
                                                                               ^
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse42_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:148:79: error: cannot convert 'bool' to '__vector(4) __bool int' in return
     inline bool cpu_has_sse42_instructions()  { return 0!=(cpuid(1)[2]&(1<<20)); }
                                                                               ^
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h: In function '__vector(4) __bool int cpu_has_avx_instructions()':
../dlib/image_processing/../image_processing/../image_transforms/../simd/simd_check.h:149:79: error: cannot convert 'bool' to '__vector(4) __bool int' in return
     inline bool cpu_has_avx_instructions()    { return 0!=(cpuid(1)[2]&(1<<28)); }

geoffrey-pascal on 20 Jun 2018

I don't see how you can get that error since cpuid() returns a std::array
not __vector. Unless cpuid() is some magic function on your system and
just needs to be renamed to avoid a name clash.

davisking on 20 Jun 2018

GCC 5.4 is extremely outdated. Why is the code testing for x86 features on Power?

edelsohn on 20 Jun 2018

Look at the code, I don't think it is.

davisking on 20 Jun 2018

I tried with GCC 7.3.1 from Advance-Toolchain 11 but I have the exact same issue.

I also tried to rename cpuid to cpuid2 in dlib/simd/simd_check.h but that doesn't solved the issue.

Is the include dlib/simd/simd_check.h really needed when compiling on Power as it seems to mainly check for x86 cpu features ?

geoffrey-pascal on 20 Jun 2018

It's not related to powerpc. But the error you are getting doesn't make sense. It's saying that this code:

std::array<unsigned int,4> cpuid(int) { return std::array<unsigned int,4>{}; }
inline bool cpu_has_avx_instructions()    { return 0!=(cpuid(1)[2]&(1<<28)); }

is the issue. But that code is fine. Maybe your compiler is just broken. Right, read the error:

error: cannot convert 'bool' to '__vector(4) __bool int' in return
inline bool cpu_has_avx_instructions()    { return 0!=(cpuid(1)[2]&(1<<28)); }

That makes no sense since that function doesn't return a vector of bools. It's just returning a bool.

Anyway, maybe there is a workaround. You need to find some minimal piece of code that generates the error and then you will be able to find some workaround based on that. But the underlying issue seems to be some weird compiler bug.

davisking on 20 Jun 2018

Sorry I've been away: the problem looks like either the compiler is not defaulting to the correct cpu arch.
IE: you're using a generic powerpc gcc and not one built and tuned for power8 w/ vsx ( most likely )
Or else something has broken the simd_check headers detection, changes elsewhere can do this and simd then defaults to SSE because of how dlib simd was originally written.

you can check the preprocessor definitions of default gcc and see if VSX is defined
( which you'll find is necessary in simd_check.h , iirc gcc -dM -E - | grep VSX ought to do it )

Shouldn't take more than 2mins to figure out what the problem is, and changing the whole structure of simd handling or adding extra check scripts to build wasn't really a great option :|

dmiller423 on 25 Jul 2018

Also note: the __vector(N) __bool types and similar are how powerpc simd internals are referenced by the frontend, while i don't agree with it : it's what we have.. It can be a pain in some cases when mixing with C++ but is not the problem you're having now.

If the default cpu arch/tune are the problem as I suspect you can just add -mcpu -mtune to the build and default to generic power8le.

There is a way to default the whole toolchain w.o rebuilding with a script, It involves digging into dumping the toolchains store and editing and replacing as a script file somewhere in the toolchain fs root : just FYI

dmiller423 on 25 Jul 2018

@dmiller423 Did you look at the cpuid() code in simd_check.h? I don't see how this can be related to powerpc. The bit of code generating the error is distinct from any of the simd code in dlib, it just happens to be located in the simd_check.h header.

davisking on 25 Jul 2018

@davisking sure, it's dependent on either supported x86 toolchain or else it just returns an uint32[4] so that extra ifdef's don't have to be used elsewhere to make the project link. It's completely harmless... the point of the simd_check.h is to filter the required definitions for SSE, AVX, VSX etc w.o having to do more complicated build checks... The flipside to this is cases where VSX is not defined, maybe a specific check might be in order

#if defined(__powerpc) && !defined(__VSX__)   
#errror PowerPC w.o VSX detected check flags  
#endif

ppc64le compilers built w.o any specific target end up being defaulted to 800 series hardware that has altivec(vmx) but no vsx and it's more of a pain than dealing w. x86 which is generally tuned to at least p5 w. basic sse and most compilers have runtime checks or cpuid is tested at runtime to determine feature set.

VSX can be tested for on ppc but it differs by OS if you want to do so from user mode, and can differ by arch some and would require root which is obv. not great. If you tell the compiler to just build with the flags it's just as much of a mess... So I of course chose the simplest path, tho perhaps the warning/error is in order?

dmiller423 on 25 Jul 2018

I still don't follow. The error seems to come from this code:

    inline std::array<unsigned int,4> cpuid(int) 
    {
        return std::array<unsigned int,4>{};
    } 
    inline bool cpu_has_sse2_instructions()   { return 0!=(cpuid(1)[3]&(1<<26)); }

That code doesn't have anything to do with any kind of SIMD or hardware specific features whatsoever.

davisking on 25 Jul 2018

I'd have to look a lot more thoroughly, I believe the only reason that is in the code is it's used outside of cpu_has_* functions somewhere as well.. If not the best thing to do would be to cut the cpuid() out of the #else it's in , and then add another wrapping all of the cpu_has_*() instead so it's only built on X86 && X64 ... since there is simd code for ARM and PPC.

dmiller423 on 25 Jul 2018

or you could change them to macro's or force the inlining since they should only be compiled if DLIB_HAVE_{SSE,AVX} are enabled below ... which is the (one) reason they haven't been a problem in the past: they should never be called on non x86 based architectures anyhow.

dmiller423 on 25 Jul 2018

cpuid() isn't called outside that code snippet.

I suspect this is some kind of error related to bad support for std::array. But who knows. @geoffrey-pascal seems inactive.

davisking on 25 Jul 2018

I tried to run gcc -dM -E - | grep VSX but it doesn't return anything. I am using the gcc compiler from Ubuntu 16.04.

geoffrey-pascal on 26 Jul 2018

$ gcc -x c -E /dev/null -g3 -o -
shows that GCC defines

__VSX__
__POWER8_VECTOR__

edelsohn on 26 Jul 2018

GCC seems ok :

$ gcc -x c -E /dev/null -g3 -o - | grep VSX
#define __VSX__ 1

$ gcc -x c -E /dev/null -g3 -o - | grep VECTOR 
#define __POWER8_VECTOR__ 1

$gcc -x c -E /dev/null -g3 -o - | grep ALTIVEC
#define __ALTIVEC__ 1
#define __APPLE_ALTIVEC__ 1

geoffrey-pascal on 26 Jul 2018

Sorry wrong format for preproc dump apparently, was on mobile at the time.
I'm on VM now, give me a min and i'll see if I can reproduce your problem.

dmiller423 on 26 Jul 2018

@geoffrey-pascal I don't have X on my VM atm, but all of the non UI examples and the library build properly...
You aren't using any of the -DUSE_{AVX,SSE}_INSTRUCTIONS=1; during build by any chance?
There is very little reason that one example should build and another fail, especially another detection example and i've tested those.
I have of course tested with UI support when I added the code as well... please let me know.
It's easy to understand if you pasted the -DUSE_AVX_INSTRUCTIONS=1; like it says in the readme.

dmiller423 on 26 Jul 2018

matched gcc in version # at least if not all build specs:
gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)

dmiller423 on 26 Jul 2018

@dmiller423 I am using the command below for the build. I am not using -DUSE_AVX_INSTRUCTIONS=1

g++ -std=c++11 -O3 -I.. -lpthread -lX11 face_detection_ex.cpp

I forgot to mention before that I am working on a Power 9 system

geoffrey-pascal on 27 Jul 2018

What if you put this in a test.cpp file and try to compile it?

  #include <array>
  inline std::array<unsigned int,4> cpuid(int) 
    {
        return std::array<unsigned int,4>{};
    } 
    inline bool cpu_has_sse2_instructions()   { return 0!=(cpuid(1)[3]&(1<<26)); }
    int main() { cpu_has_sse2_instructions(); }

Does it compile?

davisking on 27 Jul 2018

I tried with GCC 4.8 and 5.4 and it compiles for both versions.

geoffrey-pascal on 27 Jul 2018

What if you just #include simd_check.h?

You should try and find a minimal example that causes the error. For some reason that code works in that test.cpp but not inside simd_check.h. Maybe there is a #define that interferes. Maybe cpuid is #defined to something and it's imported by some header.

davisking on 27 Jul 2018

When I include simd_check.h I have the error

$ g++ -std=c++11 -O3 test.cpp 
In file included from test.cpp:2:0:
simd/simd_check.h: In function '__vector(4) __bool int cpu_has_sse2_instructions()':
simd/simd_check.h:145:80: error: cannot convert 'bool' to '__vector(4) __bool int' in return
     inline bool cpu_has_sse2_instructions()   { return 0!=(cpuid2(1)[3]&(1<<26)); }

I tried to rename cpuid to cpuid2 but it produces the same error. I also tried to grep cpuid inside the sources to check if it's defined somewhere but I haven't found anything

geoffrey-pascal on 27 Jul 2018

Ok, so now start deleting stuff from simd_check.h and see what makes the
error go away. You can figure it out :)

davisking on 27 Jul 2018

Well the error will be a bit tricky to find that way, it's most likely a conversion error (cast) from the cast operator and this constructor ::
inline simd4i(const rawarray& a) : x{a.v} { }
unless the compiler is trying to auto vector it and is extremely broken.
try commenting that out and see if the error changes...

Either way the options i posted above about ways to eliminate the cpuid check for powerpc has to work, since it's not needed, useful and does nothing at this point.

Quick impl and tested here on examples:

// ----------------------------------------------------------------------------------------
// Define functions to check, at runtime, what instructions are available

if defined(_MSC_VER) && (defined(_M_I86) || defined(_M_IX86) || defined(_M_X64) || defined(_M_AMD64) )

#include <intrin.h>

inline std::array<unsigned int,4> cpuid(int function_id)
{
    std::array<unsigned int,4> info;
    // Load EAX, EBX, ECX, EDX into info
    __cpuid((int*)info.data(), function_id);
    return info;
}

elif (defined(GNUC) || defined(clang)) && (defined(i386) || defined(i686) || defined(amd64) || defined(__x86_64__))

#include <cpuid.h>

inline std::array<unsigned int,4> cpuid(int function_id)
{
    std::array<unsigned int,4> info;
    // Load EAX, EBX, ECX, EDX into info
    __cpuid(function_id, info[0], info[1], info[2], info[3]);
    return info;
}

endif

if !defined(DLIB_HAVE_VSX) // Should prob make a DLIB_ARCH_X86 | X64 and add to above detections instead //

inline bool cpu_has_sse2_instructions()   { return 0!=(cpuid(1)[3]&(1<<26)); }
inline bool cpu_has_sse3_instructions()   { return 0!=(cpuid(1)[2]&(1<<0));  }
inline bool cpu_has_sse41_instructions()  { return 0!=(cpuid(1)[2]&(1<<19)); }
inline bool cpu_has_sse42_instructions()  { return 0!=(cpuid(1)[2]&(1<<20)); }
inline bool cpu_has_avx_instructions()    { return 0!=(cpuid(1)[2]&(1<<28)); }
inline bool cpu_has_avx2_instructions()   { return 0!=(cpuid(7)[1]&(1<<5));  }
inline bool cpu_has_avx512_instructions() { return 0!=(cpuid(7)[1]&(1<<16)); }

endif

inline void warn_about_unavailable_but_used_cpu_instructions()
{

if defined(DLIB_HAVE_AVX2)

    if (!cpu_has_avx2_instructions())
        std::cerr << "Dlib was compiled to use AVX2 instructions, but these aren't available on your machine." << std::endl;

elif defined(DLIB_HAVE_AVX)

    if (!cpu_has_avx_instructions())
        std::cerr << "Dlib was compiled to use AVX instructions, but these aren't available on your machine." << std::endl;

elif defined(DLIB_HAVE_SSE41)

    if (!cpu_has_sse41_instructions())
        std::cerr << "Dlib was compiled to use SSE41 instructions, but these aren't available on your machine." << std::endl;

elif defined(DLIB_HAVE_SSE3)

    if (!cpu_has_sse3_instructions())
        std::cerr << "Dlib was compiled to use SSE3 instructions, but these aren't available on your machine." << std::endl;

elif defined(DLIB_HAVE_SSE2)

    if (!cpu_has_sse2_instructions())
        std::cerr << "Dlib was compiled to use SSE2 instructions, but these aren't available on your machine." << std::endl;

endif

dmiller423 on 28 Jul 2018

I don't think we are all looking at the same thing. @geoffrey-pascal just said that this program causes the error as well:

 #include <array>
 #include <dlib/simd/simd_check.h>
  inline std::array<unsigned int,4> cpuid(int) 
    {
        return std::array<unsigned int,4>{};
    } 
    inline bool cpu_has_sse2_instructions()   { return 0!=(cpuid(1)[3]&(1<<26)); }
    int main() { cpu_has_sse2_instructions(); }

However, simd_check.h doesn't include anything other than iostream and array. So that program doesn't include simd4i or any other simd code. So how can the simd code have anything to do with the error?

davisking on 29 Jul 2018

@davisking It's a good question, if it's never actually used in conjunction with the simd class it _shouldn't_ : however I see no other possibility since if compiles fine _without_ the simd headers... Unless including the altivec headers and or code using vsx intrinsics causes auto vectorization (and a bug there).
So i'm going back to the original use and fix for such: since he wants to use it for dlib: the fix is simple, it removes the function completely as it's not needed.

Note: like the comment suggests, the ifdef checking for vsx is a bit backwards, it's tech not required for ARM either or anything non x86 based.. for now though it was a quick solution with little change / KISS

dmiller423 on 29 Jul 2018

@davisking I have reported numerous bugs in powerpc gcc < version 8, I believe this is quite likely another tho i'm unable to reproduce. If he was using a public VM or it was easy to reproduce i'd report it to Bill Seurer @IBM ( @edelsohn if you want to it's up to you, i'm not sure what he can do w.o a clear means of reproduction. maybe if @geoffrey-pascal wants to share his VM or full details of his configration. Either way gcc 8 after many reports and a big overhaul is much better and may not have the issue at all. )

dmiller423 on 29 Jul 2018

That's all fine, but any code change needs to be verified to fix the issue. I don't want to merge any PRs that are just guesses at what might address the problem.

davisking on 29 Jul 2018

I wasn't suggesting a pull req, since I can't reproduce the only real resolution is to find something that works through the issue he's having. If anyone else has an issue I would happily look into it, seems isolated as I know several people have used it on power8 VMs.

dmiller423 on 2 Aug 2018

👍1

I made it work on P8 like a breathe I don't understand why GCC is complaining on P9. What do you suggest ? Try with the latest GCC ?

geoffrey-pascal on 30 Aug 2018

I've mentioned numerous means of definite fixes, and some more likely to be a better long term solution.
Since I don't have access to the machine or these issues myself, that's really all I can do.

dmiller423 on 5 Sep 2018

Dlib: Optimize dlib for POWER8 VSX

Most helpful comment

All 79 comments

if defined(_MSC_VER) && (defined(_M_I86) || defined(_M_IX86) || defined(_M_X64) || defined(_M_AMD64) )

elif (defined(GNUC) || defined(clang)) && (defined(i386) || defined(i686) || defined(amd64) || defined(__x86_64__))

endif

if !defined(DLIB_HAVE_VSX) // Should prob make a DLIB_ARCH_X86 | X64 and add to above detections instead //

endif

if defined(DLIB_HAVE_AVX2)

elif defined(DLIB_HAVE_AVX)

elif defined(DLIB_HAVE_SSE41)

elif defined(DLIB_HAVE_SSE3)

elif defined(DLIB_HAVE_SSE2)

endif

Related issues

Dlib: Optimize dlib for POWER8 VSX

Most helpful comment

All 79 comments

if defined(_MSC_VER) && (defined(_M_I86) || defined(_M_IX86) || defined(_M_X64) || defined(_M_AMD64) )

elif (defined(__GNUC__) || defined(__clang__)) && (defined(__i386__) || defined(__i686__) || defined(__amd64__) || defined(__x86_64__))

endif

if !defined(DLIB_HAVE_VSX) // Should prob make a DLIB_ARCH_X86 | X64 and add to above detections instead //

endif

if defined(DLIB_HAVE_AVX2)

elif defined(DLIB_HAVE_AVX)

elif defined(DLIB_HAVE_SSE41)

elif defined(DLIB_HAVE_SSE3)

elif defined(DLIB_HAVE_SSE2)

endif

Related issues

elif (defined(GNUC) || defined(clang)) && (defined(i386) || defined(i686) || defined(amd64) || defined(__x86_64__))