I noticed a 2x performance regression in the nightlies while writing new code for num-integer.
I bisected the regression to 2018-11-11.
To reproduce:
rustup install nightly-2018-11-10
rustup install nightly-2018-11-11
git clone https://github.com/mohrezaei/num-integer
cd num-integer
set your RUSTFLAGS="-C target-cpu=native"
cargo +nightly-2018-11-10 bench benchp10_u64_10000_random
cargo +nightly-2018-11-11 bench benchp10_u64_10000_random
The first line in the benchmarks show the regression. On my machine it goes from ~4300 ns/iter to ~9300 ns/iter.
Extra info, in case it's relevant:
Default host: x86_64-pc-windows-msvc
installed toolchains
--------------------
stable-x86_64-pc-windows-msvc
nightly-2018-11-10-x86_64-pc-windows-msvc
nightly-2018-11-11-x86_64-pc-windows-msvc
nightly-x86_64-pc-windows-msvc
active toolchain
----------------
stable-x86_64-pc-windows-msvc (default)
rustc 1.30.1 (1433507eb 2018-11-07)
> rustup run nightly-2018-11-10 rustc --version
rustc 1.32.0-nightly (36a50c29f 2018-11-09)
> rustup run nightly-2018-11-11 rustc --version
rustc 1.32.0-nightly (6e9b84296 2018-11-10)
@ljedrz I noticed you did a lot of LLVM backend work on 2018-10-10. Do you know if this is related/expected?
I don't think so, these were mostly ergonomics.
@ljedrz thanks for taking a look.
I can see a different PR you could take a look at: https://github.com/rust-lang/rust/pull/55650. It has a change related to integer shifts.
Taking the diff between the two nightly hashes: https://github.com/rust-lang/rust/compare/36a50c29f...6e9b84296
The only change affecting codegen in there (not counting emscripten) is #55650. And seeing how https://github.com/mohrezaei/num-integer/commit/5be6ea1efbd504935022d2183dab303f3818b154#diff-1d6fa51a8a2182204ba56d113464a194R80 uses rotate_right, this is quite likely indeed the culprit.
@nikic @ljedrz I think you're right. Edit: It's either that or an inlining issue. When I try I'll have a closer look at rotate.--emit=asm, I get the same slow results because --emit=asm disables LTO.
Apparently this only happens with -C target-cpu=native (I really wish RUSTFLAGS was not an env variable.. it makes keeping track of these things very hard).
The optimizations happening here are really amazing. We have a simple piece of code:
fn is_pow10_u64(v: u64) -> bool {
let mut hash: u32 = v as u32;
hash ^= hash >> 3;
hash = hash.rotate_right(1);
hash ^= hash >> 23;
v == POWER10_HASH_U64[(hash & 31) as usize]
}
The two versions produce very different assembly from this. 2018-11-10 does a very clever optimization (described below) and manages to then vectorize the optimized code. 2018-11-11 translates the code nearly vebatim and doesn't do any vectorization, just plain unrolling.
This is the relevant assembly for that function, annotated:
before (11-10):
movq (%r10), %rdx
addq $8, %r10
movl %edx, %edi
shrl $3, %edi // hash >> 3
xorl %edx, %edi
movl %edi, %eax
shrl %eax // hash >> 1 !? see below
shrl $24, %edi // hash >> 24 !? see below
xorl %eax, %edi
andl $31, %edi // hash & 31
xorl %eax, %eax
cmpq %rdx, (%r8,%rdi,8)
sete %al
addl %eax, %ecx
cmpq %r10, %r9
after (11-11):
movq (%r11,%rdx,8), %rcx
movl %ecx, %esi
shrl $3, %esi // hash >> 3
xorl %ecx, %esi
rorl $1, %esi // hash.rotate_right(1)
movl %esi, %eax
shrl $23, %eax // hash >> 23
xorl %esi, %eax
andl $31, %eax // hash & 31
xorl %esi, %esi
cmpq %rcx, (%r9,%rax,8)
sete %sil
addl %esi, %edi
addq $1, %rdx
cmpq %rdx, %r8
You'll recognize the constants in the 11-11 output pretty easily, but the 11-10 output looks bizarre as it doesn't have these constants. The optimization performed in 11-10, which is missing in 11-11 is rewriting the function like so:
fn is_pow10_u64(v: u64) -> bool {
let mut hash: u32 = v as u32;
hash ^= hash >> 3;
hash = (hash >> 24) ^ (hash >> 1);
v == POWER10_HASH_U64[(hash & 31) as usize]
}
Which is why there is no rotate or shift(23). If I hand optimize the method, both 11-10 and 11-11 produce the same speed via SIMD vectorization.
Hope that helps narrow this down. That's as far as I can take the analysis.
Okay, looks like demanded bits simplification is not performed for funnel shifts yet. I can submit a patch for LLVM, but as it's going to take some time until that propagates back to us, it's probably best to revert the use of funnel shift intrinsics for now. It should be enough to drop the calls in libcore, the rest of the implementation work can stay.
@nikic we definitely can update our LLVM mid-cycle, though I don鈥檛 think funnel shifts justify that.
I've submitted https://reviews.llvm.org/D54666 upstream. Unfortunately it's not actually able to handle the case reported here, because the rotate result has more than one use.
Some support for funnel shift vectorization has also landed recently, not sure if the LLVM upgrade in #55835 has those changes yet.
Just a random thought here: would it make sense to have an optimization that converts rotates into shifts first (based on demanded bits)?
Never mind, I see the llvm code is already meant to do that.
Does the original code for this still exist? I wanted to check whether the recent LLVM update mitigates this by vectorizing the rotation, but the current code in num-integer doesn't seem to use rotations anymore.
I did make a branch locally to preserve it. I've now pushed it to github. You'll have to clone my fork and git checkout 5be6ea1
https://reviews.llvm.org/D55563 should be the last piece to handle the case encountered here.
I've also tested locally, and the last LLVM upgrade didn't change anything here wrt vectorization. Locally I can only reproduce a 20% rather than 2x difference, probably my CPU is too old.
The recent LLVM update pulled in the fix mentioned above. I've checked in the asm that the relaxation from rotate to shift does work now.
Unfortunately I can no longer reproduce any perf difference between nightly-2018-11-10, nightly-2018-11-11 and current nightly using -C target-cpu=native. I remember seeing a difference when I tried this before, but now it's all the same.
Maybe you could give this another try and check whether you get comparable performance from nightly-2018-11-10 and the current one?
Looks good. Procedure to reproduce:
rustup install nightly-2018-11-10
rustup install nightly-2018-11-11
rustup install nightly-2019-01-29
git clone https://github.com/mohrezaei/num-integer
cd num-integer
git checkout 5be6ea1
set your RUSTFLAGS="-C target-cpu=native"
cargo +nightly-2018-11-10 bench benchp10_u64_10000_random
cargo +nightly-2018-11-11 bench benchp10_u64_10000_random
cargo +nightly-2019-01-29 bench benchp10_u64_10000_random
My results are as follows:
2018-11-10:
test benchp10_u64_10000_random ... bench: 4,176 ns/iter (+/- 53)
2018-11-11:
test benchp10_u64_10000_random ... bench: 8,323 ns/iter (+/- 14)
2019-01-29:
test benchp10_u64_10000_random ... bench: 4,241 ns/iter (+/- 26)
Thanks for following up on this and feel free to close the ticket when appropriate.
:smile:
Most helpful comment
Okay, looks like demanded bits simplification is not performed for funnel shifts yet. I can submit a patch for LLVM, but as it's going to take some time until that propagates back to us, it's probably best to revert the use of funnel shift intrinsics for now. It should be enough to drop the calls in libcore, the rest of the implementation work can stay.