Rust: 2x performance regression on nightly 2018-11-10 vs 2018-11-11

Created on 16 Nov 2018  路  18Comments  路  Source: rust-lang/rust

I noticed a 2x performance regression in the nightlies while writing new code for num-integer.
I bisected the regression to 2018-11-11.

To reproduce:

rustup install nightly-2018-11-10
rustup install nightly-2018-11-11
git clone https://github.com/mohrezaei/num-integer
cd num-integer
set your RUSTFLAGS="-C target-cpu=native"
cargo +nightly-2018-11-10 bench benchp10_u64_10000_random
cargo +nightly-2018-11-11 bench benchp10_u64_10000_random

The first line in the benchmarks show the regression. On my machine it goes from ~4300 ns/iter to ~9300 ns/iter.

A-LLVM I-slow

Most helpful comment

Okay, looks like demanded bits simplification is not performed for funnel shifts yet. I can submit a patch for LLVM, but as it's going to take some time until that propagates back to us, it's probably best to revert the use of funnel shift intrinsics for now. It should be enough to drop the calls in libcore, the rest of the implementation work can stay.

All 18 comments

Extra info, in case it's relevant:

Default host: x86_64-pc-windows-msvc

installed toolchains
--------------------

stable-x86_64-pc-windows-msvc
nightly-2018-11-10-x86_64-pc-windows-msvc
nightly-2018-11-11-x86_64-pc-windows-msvc
nightly-x86_64-pc-windows-msvc

active toolchain
----------------

stable-x86_64-pc-windows-msvc (default)
rustc 1.30.1 (1433507eb 2018-11-07)
> rustup run nightly-2018-11-10 rustc --version
rustc 1.32.0-nightly (36a50c29f 2018-11-09)
> rustup run nightly-2018-11-11 rustc --version
rustc 1.32.0-nightly (6e9b84296 2018-11-10)

@ljedrz I noticed you did a lot of LLVM backend work on 2018-10-10. Do you know if this is related/expected?

I don't think so, these were mostly ergonomics.

@ljedrz thanks for taking a look.

I can see a different PR you could take a look at: https://github.com/rust-lang/rust/pull/55650. It has a change related to integer shifts.

Taking the diff between the two nightly hashes: https://github.com/rust-lang/rust/compare/36a50c29f...6e9b84296

The only change affecting codegen in there (not counting emscripten) is #55650. And seeing how https://github.com/mohrezaei/num-integer/commit/5be6ea1efbd504935022d2183dab303f3818b154#diff-1d6fa51a8a2182204ba56d113464a194R80 uses rotate_right, this is quite likely indeed the culprit.

@nikic @ljedrz I think you're right. Edit: It's either that or an inlining issue. When I try --emit=asm, I get the same slow results because --emit=asm disables LTO. I'll have a closer look at rotate.

Apparently this only happens with -C target-cpu=native (I really wish RUSTFLAGS was not an env variable.. it makes keeping track of these things very hard).

The optimizations happening here are really amazing. We have a simple piece of code:

fn is_pow10_u64(v: u64) -> bool {
    let mut hash: u32 = v as u32;
    hash ^= hash >> 3;
    hash = hash.rotate_right(1);
    hash ^= hash >> 23;
    v == POWER10_HASH_U64[(hash & 31) as usize]
}

The two versions produce very different assembly from this. 2018-11-10 does a very clever optimization (described below) and manages to then vectorize the optimized code. 2018-11-11 translates the code nearly vebatim and doesn't do any vectorization, just plain unrolling.

This is the relevant assembly for that function, annotated:

before (11-10):
    movq    (%r10), %rdx
    addq    $8, %r10
    movl    %edx, %edi
    shrl    $3, %edi // hash >> 3
    xorl    %edx, %edi
    movl    %edi, %eax
    shrl    %eax // hash >> 1 !? see below
    shrl    $24, %edi // hash >> 24 !? see below
    xorl    %eax, %edi
    andl    $31, %edi // hash & 31
    xorl    %eax, %eax
    cmpq    %rdx, (%r8,%rdi,8)
    sete    %al
    addl    %eax, %ecx
    cmpq    %r10, %r9


after (11-11):

    movq    (%r11,%rdx,8), %rcx
    movl    %ecx, %esi
    shrl    $3, %esi // hash >> 3
    xorl    %ecx, %esi
    rorl    $1, %esi // hash.rotate_right(1)
    movl    %esi, %eax
    shrl    $23, %eax // hash >> 23
    xorl    %esi, %eax
    andl    $31, %eax // hash & 31
    xorl    %esi, %esi
    cmpq    %rcx, (%r9,%rax,8)
    sete    %sil
    addl    %esi, %edi
    addq    $1, %rdx
    cmpq    %rdx, %r8

You'll recognize the constants in the 11-11 output pretty easily, but the 11-10 output looks bizarre as it doesn't have these constants. The optimization performed in 11-10, which is missing in 11-11 is rewriting the function like so:

fn is_pow10_u64(v: u64) -> bool {
    let mut hash: u32 = v as u32;
    hash ^= hash >> 3;
    hash = (hash >> 24) ^ (hash >> 1);
    v == POWER10_HASH_U64[(hash & 31) as usize]
}

Which is why there is no rotate or shift(23). If I hand optimize the method, both 11-10 and 11-11 produce the same speed via SIMD vectorization.

Hope that helps narrow this down. That's as far as I can take the analysis.

Okay, looks like demanded bits simplification is not performed for funnel shifts yet. I can submit a patch for LLVM, but as it's going to take some time until that propagates back to us, it's probably best to revert the use of funnel shift intrinsics for now. It should be enough to drop the calls in libcore, the rest of the implementation work can stay.

@nikic we definitely can update our LLVM mid-cycle, though I don鈥檛 think funnel shifts justify that.

I've submitted https://reviews.llvm.org/D54666 upstream. Unfortunately it's not actually able to handle the case reported here, because the rotate result has more than one use.

Some support for funnel shift vectorization has also landed recently, not sure if the LLVM upgrade in #55835 has those changes yet.

Just a random thought here: would it make sense to have an optimization that converts rotates into shifts first (based on demanded bits)?

Never mind, I see the llvm code is already meant to do that.

Does the original code for this still exist? I wanted to check whether the recent LLVM update mitigates this by vectorizing the rotation, but the current code in num-integer doesn't seem to use rotations anymore.

I did make a branch locally to preserve it. I've now pushed it to github. You'll have to clone my fork and git checkout 5be6ea1

https://reviews.llvm.org/D55563 should be the last piece to handle the case encountered here.

I've also tested locally, and the last LLVM upgrade didn't change anything here wrt vectorization. Locally I can only reproduce a 20% rather than 2x difference, probably my CPU is too old.

The recent LLVM update pulled in the fix mentioned above. I've checked in the asm that the relaxation from rotate to shift does work now.

Unfortunately I can no longer reproduce any perf difference between nightly-2018-11-10, nightly-2018-11-11 and current nightly using -C target-cpu=native. I remember seeing a difference when I tried this before, but now it's all the same.

Maybe you could give this another try and check whether you get comparable performance from nightly-2018-11-10 and the current one?

Looks good. Procedure to reproduce:

rustup install nightly-2018-11-10
rustup install nightly-2018-11-11
rustup install nightly-2019-01-29
git clone https://github.com/mohrezaei/num-integer
cd num-integer
git checkout 5be6ea1
set your RUSTFLAGS="-C target-cpu=native"
cargo +nightly-2018-11-10 bench benchp10_u64_10000_random
cargo +nightly-2018-11-11 bench benchp10_u64_10000_random
cargo +nightly-2019-01-29 bench benchp10_u64_10000_random

My results are as follows:

2018-11-10:
test benchp10_u64_10000_random        ... bench:       4,176 ns/iter (+/- 53)

2018-11-11:
test benchp10_u64_10000_random        ... bench:       8,323 ns/iter (+/- 14)

2019-01-29:
test benchp10_u64_10000_random        ... bench:       4,241 ns/iter (+/- 26)

Thanks for following up on this and feel free to close the ticket when appropriate.
:smile:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

drewcrawford picture drewcrawford  路  3Comments

behnam picture behnam  路  3Comments

nikomatsakis picture nikomatsakis  路  3Comments

pedrohjordao picture pedrohjordao  路  3Comments

SharplEr picture SharplEr  路  3Comments