Zstd: Potential performance regressions from use of __builtin_expect()

Created on 16 Aug 2020  路  9Comments  路  Source: facebook/zstd

Describe the bug
Clang prints warnings about potential performance regressions with -Wexpect.

To Reproduce
Steps to reproduce the behavior:
Build zstd-pgo with clang and -Wmisexpect. Clang prints warnings about potential performance regressions from use of __builtin_expect().

Expected behavior
Correct usage of __builtin_expect(). No perf regressions.

Screenshots and charts

../lib/decompress/zstd_decompress_block.c:874:17: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 90.49% (1747623 / 1931364) of profiled executions. [-Wmisexpect]
if (LIKELY((ofBits == 0))) {
^
../lib/decompress/zstd_decompress_block.c:873:37: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 18.57% (324470 / 1747623) of profiled executions. [-Wmisexpect]
U32 const ll0 = (llBase == 0);
^
../lib/decompress/zstd_decompress_block.c:743:9: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 87.83% (399520478 / 454868225) of profiled executions. [-Wmisexpect]
if (UNLIKELY(sequence.litLength > 16)) {

Only second warning seems relevant, code at that line:

            U32 const ll0 = (llBase == 0);
            if (LIKELY((ofBits == 0))) {
                if (LIKELY(!ll0))
                    offset = seqState->prevOffset[0];
                else {
                    offset = seqState->prevOffset[1];
                    seqState->prevOffset[1] = seqState->prevOffset[0];
                    seqState->prevOffset[0] = offset;
                }

Desktop (please complete the following information):
Ubuntu 18.04 LTS

All 9 comments

Thanks for report @davidbolvansky !

These stats can swing pretty widely depending on which content is being decompressed.
For the benefit of reproduction, I have a few questions :
what was the file being compressed ? What are the compression parameters ?

cd /programs
CC="clang" make zstd-pgo

./zstd -b19i1
19#Synthetic 50% : 10000000 -> 3146081 (3.179), 1.78 MB/s ,1077.9 MB/s
./zstd -b16i1
16#Synthetic 50% : 10000000 -> 3083182 (3.243), 4.96 MB/s ,1305.1 MB/s
./zstd -b9i2
9#Synthetic 50% : 10000000 -> 3321085 (3.011), 19.8 MB/s , 685.2 MB/s
./zstd -b
3#Synthetic 50% : 10000000 -> 3231410 (3.095), 147.2 MB/s ,1035.3 MB/s
./zstd -b7i2
7#Synthetic 50% : 10000000 -> 3324298 (3.008), 20.2 MB/s , 804.8 MB/s
./zstd -b5
5#Synthetic 50% : 10000000 -> 3335646 (2.998), 39.3 MB/s , 891.0 MB/s

Thanks for feedback @davidbolvansky ,
ok these are synthetic samples.

These samples "try" to mimic the behavior of a compressible source, but when it comes to specific metrics, such as the probability of !ll0 when ofBits==0, they may come short, and generate patterns which are not that representative of real data.

I had a cursory look at it, using various sources, and it seems that we would simply be better off by removing any expect directive on line 875. Measurement can be complex, especially as code alignment side effect can result in larger impact than the code modification itself. I never noticed a big speed difference, but it leans towards positive territory, in the ~1% range (almost within noise margin).

As this specific segment of code was updated in #1892, it's likely better to let @terrelln have a look at it first.

Data for PROFILE_WITH=silesia/*

../lib/decompress/../common/bitstream.h:409:9: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 99.94% (60549783 / 60587310) of profiled executions. [-Wmisexpect]
if (UNLIKELY(bitD->ptr < bitD->limitPtr))
^
../lib/decompress/../common/bitstream.h:409:9: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 99.94% (60549684 / 60587310) of profiled executions. [-Wmisexpect]
../lib/decompress/../common/bitstream.h:409:9: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 99.94% (60550435 / 60587310) of profiled executions. [-Wmisexpect]
../lib/decompress/../common/bitstream.h:409:9: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 99.90% (60527152 / 60587310) of profiled executions. [-Wmisexpect]
4 warnings generated.
../lib/decompress/zstd_decompress_block.c:874:17: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 95.92% (97629335 / 101785589) of profiled executions. [-Wmisexpect]
if (LIKELY((ofBits == 0))) {
^
../lib/decompress/zstd_decompress_block.c:873:37: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 19.23% (18774128 / 97629335) of profiled executions. [-Wmisexpect]
U32 const ll0 = (llBase == 0);
^
../lib/decompress/zstd_decompress_block.c:743:9: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 99.04% (710227257 / 717119407) of profiled executions. [-Wmisexpect]
if (UNLIKELY(sequence.litLength > 16)) {
^
../lib/decompress/zstd_decompress_block.c:774:9: warning: Potential performance regression from use of __builtin_expect(): Annotation was correct on 98.26% (704648073 / 717119407) of profiled executions. [-Wmisexpect]
if (LIKELY(sequence.offset >= WILDCOPY_VECLEN)) {

Thanks @davidbolvansky , it's a nice confirmation.

Just for my understanding,
I'm a bit confused by several warning messages,
when report is also able to generate something like :

warning: Potential performance regression from use of __builtin_expect(): 
Annotation was correct on 99.94% (60549783 / 60587310) of profiled executions

I mean, 99.94%, this should be pretty good. What's the issue ? Is every expect statement condemned to be flagged by -Wmisexpect ?

Is every expect statement condemned to be flagged by -Wmisexpect ?

Yeah, sadly yes. I created PR at llvm bugzilla to be able set threshold value when to not warn - like -Wexpect=30.

Zstd has only one suspicious case with 19%.

We can try to remove the !ll0 likely statement. They're generally only added when they impacted performance, but some of them could have just caused better alignment, rather the impacting performance directly, so could be removed.

I doubt there will be significant performance changes (excepting alignment issues). This code is only executed for repcodes, which generally aren't very common (make up maybe <= 10%). Though it may matter on particular types of data that are very repcode heavy.

I've measured the impact of the first two changes independently and together. They generally can help clang, but both all the combinations hurt gcc performance. I've experimented with branch-free repcode updates using vector permutations, but so far haven't found anything that improves performance. I bet there is a way to do it, but I haven't figured it out yet.

The third suggestion seems to help clang, and be neutral for gcc. It might just be alignment. I will measure it on a few more platforms and put up a diff if it continues to help clang.

The 3rd suggestion was neutral or maybe slightly positive but within measurement noise on the other two platforms. So I think we just leave it as-is, until we get more definitive signal that it is a positive change, not just alignment noise.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ga92yup picture ga92yup  路  3Comments

Hedda picture Hedda  路  4Comments

robert3005 picture robert3005  路  4Comments

rgdoliveira picture rgdoliveira  路  3Comments

pjebs picture pjebs  路  3Comments