Beast: cpuid / sse42 broken on gcc/clang

Created on 4 Jul 2017  路  39Comments  路  Source: boostorg/beast

Seems it only works on Windows I need help getting it functioning elsewhere.

Most helpful comment

Some results here (Linux, gcc-7):

587ms / 668ms = 88% ==> -12% execution time

Without _mm_cmpestri

beast.benchmarks.buffers

count=1024, size=1024            prepare      with hint         random
multi_buffer            :     12120 MB/s     11901 MB/s     11309 MB/s   63ms
flat_buffer             :      6945 MB/s      6442 MB/s      5331 MB/s   121ms
boost::asio::streambuf  :      6790 MB/s      7642 MB/s      5480 MB/s   115ms

count=512, size=4096             prepare      with hint         random
multi_buffer            :     24059 MB/s     23551 MB/s     21762 MB/s   64ms
flat_buffer             :      9961 MB/s      9970 MB/s      7575 MB/s   166ms
boost::asio::streambuf  :      8902 MB/s      8592 MB/s      6480 MB/s   191ms

count=256, size=32768            prepare      with hint         random
multi_buffer            :     28133 MB/s     28111 MB/s     27259 MB/s   215ms
flat_buffer             :      8054 MB/s      7910 MB/s      6892 MB/s   792ms
boost::asio::streambuf  :      7085 MB/s      7079 MB/s      5472 MB/s   929ms

beast.benchmarks.parser
beast.benchmarks.parser Parser speed test, 342764KB in 1000000 messages
sizeof(request parser)  == 56
sizeof(response parser) == 56
http::basic_parser
Trial 1: 668 ms
Trial 2: 668 ms
Trial 3: 668 ms
Trial 4: 668 ms
Trial 5: 668 ms
Trial 6: 669 ms
Trial 7: 668 ms
Trial 8: 668 ms
Trial 9: 668 ms
Trial 10: 668 ms
Longest suite times:
    8.2s beast.benchmarks.buffers
    6.7s beast.benchmarks.parser
14.9s, 2 suites, 3 cases, 10002003 tests total, 0 failures

With _mm_cmpestri

beast.benchmarks.buffers

count=1024, size=1024            prepare      with hint         random
multi_buffer            :     12189 MB/s     11913 MB/s     11245 MB/s   63ms
flat_buffer             :      6916 MB/s      6410 MB/s      5313 MB/s   122ms
boost::asio::streambuf  :      8003 MB/s      8233 MB/s      5754 MB/s   105ms

count=512, size=4096             prepare      with hint         random
multi_buffer            :     25245 MB/s     25024 MB/s     22533 MB/s   61ms
flat_buffer             :     10433 MB/s     10461 MB/s      7826 MB/s   159ms
boost::asio::streambuf  :      9317 MB/s      8882 MB/s      6767 MB/s   183ms

count=256, size=32768            prepare      with hint         random
multi_buffer            :     29570 MB/s     29394 MB/s     27949 MB/s   207ms
flat_buffer             :      8429 MB/s      8422 MB/s      6873 MB/s   766ms
boost::asio::streambuf  :      7381 MB/s      7408 MB/s      5628 MB/s   895ms

beast.benchmarks.parser
beast.benchmarks.parser Parser speed test, 342764KB in 1000000 messages
sizeof(request parser)  == 56
sizeof(response parser) == 56
http::basic_parser
Trial 1: 586 ms
Trial 2: 587 ms
Trial 3: 587 ms
Trial 4: 587 ms
Trial 5: 587 ms
Trial 6: 587 ms
Trial 7: 587 ms
Trial 8: 587 ms
Trial 9: 587 ms
Trial 10: 587 ms
Longest suite times:
    7.9s beast.benchmarks.buffers
    5.9s beast.benchmarks.parser
13.7s, 2 suites, 3 cases, 10002003 tests total, 0 failures

All 39 comments

What exactly is broken?

It compiles but I dont think I got the compiler flags set to allow the intrinsics to be used. And I dont think the macros for detecting if intrinsics are available on gcc/clang are right. This all works on Windows by the way, 20% speedup. Here it is:
https://github.com/vinniefalco/Beast/blob/v72/include/beast/core/detail/cpu_info.hpp#L11

Thanks for looking!

1) You need -march=<something that has sse4.2> or -msse42 directly to make gcc provide these intrinsics. Probably the same with clang. MSVC doesn't have or need this.
2) You should #include <x86intrin.h> (GCC, clang) or #include <intrin.h> (MSVC) to provide the prototypes.
3) The detection is wrong, try this

# if defined(__SSE4_2__) || defined(_MSC_VER)
#  define BEAST_NO_INTRINSICS 0
# else
#  define BEAST_NO_INTRINSICS 1
# endif

4) I find the double negation confusing, why not BEAST_USE_INTRINISCS?

Could I trouble you to submit a pull request? Then Travis will kick in and we'll see if it works

I find the double negation confusing, why not BEAST_USE_INTRINISCS?
That's the Boost style

Don't use the __asm stuff! You can use this for gcc and clang:

#include <iostream>
#include <cstdlib>
#include <cpuid.h>

int main()
{
    std::uint32_t id = 0x1;
    std::uint32_t eax = 0;
    std::uint32_t ebx = 0;
    std::uint32_t ecx = 0;
    std::uint32_t edx = 0;

    bool supported = __get_cpuid(id,  &eax,  &ebx,  &ecx, &edx);

    std::cout << std::boolalpha << supported << std::endl;
    std::cout << eax << std::endl;
    std::cout << ebx << std::endl;
    std::cout << ecx << std::endl;
    std::cout << edx << std::endl;

    return 0;
}

Tested with

  • Apple LLVM version 7.3.0 (clang-703.0.31)
  • gcc 4.8.1 (wandbox)

I realistically won't have time in the near future, sorry... Next week at the earliest. Maybe @octopus-prime wants to have a go at it?

There's no rush at all

/* Return cpuid data for requested cpuid level, as found in returned
   eax, ebx, ecx and edx registers.  The function checks if cpuid is
   supported and returns 1 for valid cpuid information or 0 for
   unsupported cpuid level.  All pointers are required to be non-null.  */
int __get_cpuid (unsigned int __level,
    unsigned int *__eax, unsigned int *__ebx,
    unsigned int *__ecx, unsigned int *__edx)

I believe you... I think you should submit a pull request against v72! Maybe run the benchmarks before and after to see if its better?

@octopus-prime With GCC 4.8 and clang 3.7.1 there's even __builtin_cpu_supports("sse4.2") :-)

dang, that's awesome

20% speed up is waiting for you if you put in the work :)

@mika-fischer __builtin_cpu_supports("sse4.2") is runtime...

Here is an example:

          if (__builtin_cpu_supports ("popcnt"))
            {
               asm("popcnt %1,%0" : "=r"(count) : "rm"(n) : "cc");
            }
          else
            {
               count = generic_countbits (n); //generic implementation.
            }

If there is a dependency for SSE4.2 (i am not sure about that for cpuid) we should sort them out at compile time using something like:

#include <boost/predef/hardware/simd.h>

#if (BOOST_HW_SIMD_X86 >= BOOST_HW_SIMD_X86_SSE4_2_VERSION)
    ...
#else
    ...
#endif

The way I wrote the code is so that the instructions can be detected at run-time, this way the same binary will work on i386 systems with or without SSE4.2. If the extensions are available, it will use them, otherwise it will just run the regular code. There are two settings here: 1. Whether the extensions are available at compile time (needs a switch I guess) and 2. Whether the extensions are available at run-time (call __cpuid). Beast caches the value at run-time in the cpu_info.

@octopus-prime cpuid is also runtime. That's the point. That way the executable still works on processors without SSE4.2

Okay, got it now :-)

#include <iostream>
#include <cstdlib>
#include <cpuid.h>
#include <chrono>

bool check_sse42_by_cpuid()
{
    std::uint32_t id = 0x1;
    std::uint32_t eax = 0;
    std::uint32_t ebx = 0;
    std::uint32_t ecx = 0;
    std::uint32_t edx = 0;

    __get_cpuid(id,  &eax,  &ebx,  &ecx, &edx);

    return ecx & 1 << 20;
}

bool check_sse42_by_builtin_cpu_supports()
{
    return __builtin_cpu_supports("sse4.2");
}

template <typename S, typename F>
void test(S string, F check)
{
    using namespace std::chrono;
    auto const t0 = high_resolution_clock::now();
    auto const r = check();
    auto const t1 = high_resolution_clock::now();

    std::cout << string << ": " << std::boolalpha << r << ", " << duration_cast<nanoseconds>(t1 - t0).count() << "ns" << std::endl;
}

int main()
{
    test("by_cpuid", check_sse42_by_cpuid);
    test("by_builtin_cpu_supports", check_sse42_by_builtin_cpu_supports);
    return 0;
}

==>

by_cpuid: true, 3991ns
by_builtin_cpu_supports: true, 216ns

So the good news is builtin_cpu_supports is at least 10x faster than cpuid.
But the bad news is for clang (at least on osx using libc++) the code does compile but not link.

I don't think speed matters too much, since this will be only executed once per process. Also note that the check with cpuid is wrong, see https://github.com/vinniefalco/Beast/blob/v72/include/beast/core/detail/cpu_info.hpp#L87-L92

BTW: is this correct??
sse42 = (ecx & 20) != 0;

https://en.wikipedia.org/wiki/CPUID
says for SSE4.2 bit 20 on ecx

So it should be
sse42 = (ecx & 1 << 20) != 0;

Good catch! :-)

So the good news is builtin_cpu_supports is at least 10x faster than cpuid.

I think this is not true. Maybe wandbox runs on virtualised hardware...
On native hardware i get this:

by_cpuid: true, 169ns

WOW so you're telling me I got 20% and that's without SSE4.2?!

Probably the broken check also returned true (by chance) and so SSE4.2 got used after all.

Oh :(

Are these the corresponding benchmarks?

  • beast.benchmarks.buffers
  • beast.benchmarks.parser

Yes, you can open buffers.cpp and just comment out the line

BEAST_DEFINE_TESTSUITE(buffers,benchmarks,beast);

To make the tests run faster (buffers doesn't benefit from intrinsics)

Some results here (Linux, gcc-7):

587ms / 668ms = 88% ==> -12% execution time

Without _mm_cmpestri

beast.benchmarks.buffers

count=1024, size=1024            prepare      with hint         random
multi_buffer            :     12120 MB/s     11901 MB/s     11309 MB/s   63ms
flat_buffer             :      6945 MB/s      6442 MB/s      5331 MB/s   121ms
boost::asio::streambuf  :      6790 MB/s      7642 MB/s      5480 MB/s   115ms

count=512, size=4096             prepare      with hint         random
multi_buffer            :     24059 MB/s     23551 MB/s     21762 MB/s   64ms
flat_buffer             :      9961 MB/s      9970 MB/s      7575 MB/s   166ms
boost::asio::streambuf  :      8902 MB/s      8592 MB/s      6480 MB/s   191ms

count=256, size=32768            prepare      with hint         random
multi_buffer            :     28133 MB/s     28111 MB/s     27259 MB/s   215ms
flat_buffer             :      8054 MB/s      7910 MB/s      6892 MB/s   792ms
boost::asio::streambuf  :      7085 MB/s      7079 MB/s      5472 MB/s   929ms

beast.benchmarks.parser
beast.benchmarks.parser Parser speed test, 342764KB in 1000000 messages
sizeof(request parser)  == 56
sizeof(response parser) == 56
http::basic_parser
Trial 1: 668 ms
Trial 2: 668 ms
Trial 3: 668 ms
Trial 4: 668 ms
Trial 5: 668 ms
Trial 6: 669 ms
Trial 7: 668 ms
Trial 8: 668 ms
Trial 9: 668 ms
Trial 10: 668 ms
Longest suite times:
    8.2s beast.benchmarks.buffers
    6.7s beast.benchmarks.parser
14.9s, 2 suites, 3 cases, 10002003 tests total, 0 failures

With _mm_cmpestri

beast.benchmarks.buffers

count=1024, size=1024            prepare      with hint         random
multi_buffer            :     12189 MB/s     11913 MB/s     11245 MB/s   63ms
flat_buffer             :      6916 MB/s      6410 MB/s      5313 MB/s   122ms
boost::asio::streambuf  :      8003 MB/s      8233 MB/s      5754 MB/s   105ms

count=512, size=4096             prepare      with hint         random
multi_buffer            :     25245 MB/s     25024 MB/s     22533 MB/s   61ms
flat_buffer             :     10433 MB/s     10461 MB/s      7826 MB/s   159ms
boost::asio::streambuf  :      9317 MB/s      8882 MB/s      6767 MB/s   183ms

count=256, size=32768            prepare      with hint         random
multi_buffer            :     29570 MB/s     29394 MB/s     27949 MB/s   207ms
flat_buffer             :      8429 MB/s      8422 MB/s      6873 MB/s   766ms
boost::asio::streambuf  :      7381 MB/s      7408 MB/s      5628 MB/s   895ms

beast.benchmarks.parser
beast.benchmarks.parser Parser speed test, 342764KB in 1000000 messages
sizeof(request parser)  == 56
sizeof(response parser) == 56
http::basic_parser
Trial 1: 586 ms
Trial 2: 587 ms
Trial 3: 587 ms
Trial 4: 587 ms
Trial 5: 587 ms
Trial 6: 587 ms
Trial 7: 587 ms
Trial 8: 587 ms
Trial 9: 587 ms
Trial 10: 587 ms
Longest suite times:
    7.9s beast.benchmarks.buffers
    5.9s beast.benchmarks.parser
13.7s, 2 suites, 3 cases, 10002003 tests total, 0 failures

Not bad at all!!! Note that the speedup will be more noticeable with real-world / longer headers, since the intrinsics process 16 bytes at a time.

What are the supported compilers?

  • msvc
  • gcc
  • clang

More?

For now msvc, gcc, clang, but when Beast becomes part of Boost then it will go into the test matrix and I will try to support as many compiler as is reasonably possible without creating a ton of work.

As @mika-fischer mentioned: There could be something wrong with BEAST_NO_INTRINSICS.
Can we remove it for now? The supported compilers don't need.

I would suggest for cpu_info.hpp

#if defined(_MSC_VER)
#include <intrin.h> // __cpuid
#else
#include <cpuid.h>  // __get_cpuid
#endif

namespace beast {
namespace detail {

/*  Portions from Boost,
    Copyright Andrey Semashev 2007 - 2015.
*/
template<class = void>
void
cpuid(
    std::uint32_t id,
    std::uint32_t& eax,
    std::uint32_t& ebx,
    std::uint32_t& ecx,
    std::uint32_t& edx)
{
#if defined(_MSC_VER)
    int regs[4];
    __cpuid(regs, id);
    eax = regs[0];
    ebx = regs[1];
    ecx = regs[2];
    edx = regs[3];
#else
    __get_cpuid(id,  &eax,  &ebx,  &ecx, &edx);
#endif
}

And for basic_parser.hpp

#include <boost/predef/hardware/simd.h>

#if (BOOST_HW_SIMD_X86 >= BOOST_HW_SIMD_X86_SSE4_2_VERSION)
#include <nmmintrin.h>
#endif

...

#if (BOOST_HW_SIMD_X86 >= BOOST_HW_SIMD_X86_SSE4_2_VERSION)

    std::pair<char const*, bool>
    find_fast(
        char const* buf,
        char const* buf_end,
        char const* ranges,
        size_t ranges_size)
    {
        bool found = false;

        if(BOOST_LIKELY(sse42_))
        {
            if(BOOST_LIKELY(buf_end - buf >= 16))
            {
                __m128i ranges16 = _mm_loadu_si128((__m128i const*)ranges);
                std::size_t left = (buf_end - buf) & ~15;
                do
                {
                    __m128i b16 = _mm_loadu_si128((__m128i const*)buf);
                    int r = _mm_cmpestri(ranges16, ranges_size, b16, 16,
                        _SIDD_LEAST_SIGNIFICANT | _SIDD_CMP_RANGES | _SIDD_UBYTE_OPS);
                    if(BOOST_UNLIKELY(r != 16))
                    {
                        buf += r;
                        found = true;
                        break;
                    }
                    buf += 16;
                    left -= 16;
                }
                while(BOOST_LIKELY(left != 0));
            }
        }
        return {buf, found};
    }

#else

    constexpr std::pair<char const*, bool>
    find_fast(
        char const* buf,
        char const* buf_end,
        char const* ranges,
        size_t ranges_size)
    {
        return {buf, false};
    }

#endif

here could be something wrong with BEAST_NO_INTRINSICS. Can we remove it for now? The supported compilers don't need.

Hmm, no we need this macro but I don't mind if you change how it is set. There are some requirements:

  • <beast/core/detail/cpu_info.hpp> checks and maybe sets BEAST_NO_INTRINSICS

  • If the macro is already set, then use the value

  • If that macro is not set, then the first thing that cpu_info.hpp should do is set the macro based on the availability of intrinsics, look at the preprocessor directives for this (BOOST_HW_SIMD_X86 >= BOOST_HW_SIMD_X86_SSE4_2_VERSION) for example.

  • Everywhere there is code which might use intrinsics, check the macro, i.e. #if ! BEAST_NO_INTRINSICS.

Here's an example, probably wrong but you get the idea:

#if ! defined(BEAST_NO_INTRINSICS)
#include <boost/predef/hardware/simd.h>
# if (BOOST_HW_SIMD_X86 >= BOOST_HW_SIMD_X86_SSE4_2_VERSION)
#  define BEAST_NO_INTRINSICS 0
# else
#  define BEAST_NO_INTRINSICS 1
# endif
#endif

Users need to be able to disable intrinsics if they want, by adding -DBEAST_NO_INTRINSICS=1 to their compile flags.

Please open a pull request :)

Tried v73. SSE4.2 optimization was not enabled (here on Linux).
I am building with -march=native and CPU has SSE4.2.

So can we try another easy variant? Look:

mike@workstation:~/workspace/Beast$ clang++ -msse4.2 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

mike@workstation:~/workspace/Beast$ g++ -msse4.2 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

So what about this?

#ifndef BEAST_NO_INTRINSICS
# if defined(_MSC_VER) || defined(__SSE4_2__)
#  define BEAST_NO_INTRINSICS 0
# else
#  define BEAST_NO_INTRINSICS 1
# endif
#endif

I'm in favor of anything that works :)

Here it works fine...
So i could prepare a new pull request. Branch 74?

@octopus-prime https://github.com/vinniefalco/Beast/issues/585#issuecomment-312804972 :-)

The code had to be turned off due to licensing issues anyway. It will come back later in the year.

Was this page helpful?
0 / 5 - 0 ratings