Rust: Small perf regression in #69635

Created on 4 Mar 2020 · 13Comments · Source: rust-lang/rust

69635 (a rollup) caused a small performance regression. Since the regression has persisted through the last half dozen perf runs, it is likely not spurious. There are no outstanding perf runs near the regression (2020-03-02), so I am fairly confident in this attribution. That rollup consists of the following PRs:

#68682 (Add documentation to compiler intrinsics)
#69544 (Unrevert "Remove checked_add in Layout::repeat")
#69617 (constify mem::forget)
#69622 (Rename syntax in librustc_ast/README.md)
#69623 (stash API: remove panic to fix ICE.)
#69624 (Toolstate: Don't block beta week on already broken tools.)

I would not expect any of them to negatively impact performance, although I did do a perf run with an unchecked_add in Layout::repeat to no avail (see #69679). The regression is only noticeable in clean/patched incremental builds, so I don't think this is high priority, but it would be nice to know what's going on here.

cc @nnethercote @jonas-schievink @Mark-Simulacrum

I-compiletime T-compiler

Source

ecstatic-morse

Most helpful comment

The regression ultimately disappeared after #69879 (likely #69799).

@TimDiekmann It appears that minor changes to innocent-looking code in alloc.rs can result in relatively large perf changes. I suspect this will continue as alloc.rs is refactored. I think it would be a good idea to rollup=never or at least do a perf run on PRs that touch the allocator traits so we can more precisely attribute perf regressions. To clarify, this does not mean that changes should be blocked if they do indeed cause a regression.

ecstatic-morse on 11 Mar 2020

👍2

All 13 comments

The perf results for #69241, the original revert of #67174, are basically the mirror of the regression mentioned above. #67220, the first rollup to include #67174, also caused a performance regression, but it was attributed to a different PR in that rollup.

Therefore, I'm confident that #67174/#69544 is the root cause of this regression.

69679 suggests that the optimizer does not benefit from being able to assume that the addition at the start of `Layout::repeat` does not wrap. Assuming that the invariant cited in #67174 does in fact hold, my only hypothesis is that removing the `?` operator causes `Layout::repeat` to become eligible for inlining, which in turn makes one of its callees large enough to exceed the inlining threshold.

In any case, this seems like it should interest @nnethercote, since it indicates that the code in Layout is hot enough to benefit from micro-optimization. TBH, I'd also like to learn what approach they would use to investigate this kind of issue.

ecstatic-morse on 4 Mar 2020

Here is some output from a Callgrind profile that I had lying around. This is for a Check-CleanIncr run of ucd, which saw a 4.6% regression:

        .               pub fn repeat(&self, n: usize) -> Result<(Self, usize), LayoutErr> {
        .                   // Warning, removing the checked_add here led to segfaults in #67174. Further
        .                   // analysis in #69225 seems to indicate that this is an LTO-related
        .                   // miscompilation, so #67174 might be able to be reapplied in the future.
        .                   let padded_size = self
        .                       .size()
        .                       .checked_add(self.padding_needed_for(self.align()))
        .                       .ok_or(LayoutErr { private: () })?;
1,313,433 ( 0.01%)          let alloc_size = padded_size.checked_mul(n).ok_or(LayoutErr { private: () })?;
        .
        .                   unsafe {
        .                       // self.align is already known to be valid and alloc_size has been
        .                       // padded already.
        .                       Ok((Layout::from_size_align_unchecked(alloc_size, self.align()), padded_size))
        .                   }
        .               }

It suggests two things:

This function is not at all hot, so it's surprising that changing it would have a large perf effect.
It (the old version) is getting inlined. (If it wasn't inlined, there would be non-zero instruction counts on the fn line and the closing brace line.)

I can do a proper analysis tomorrow if that would be helpful. That will likely involve running Cachegrind on ucd-Check-CleanIncr with both versions of the compiler, and then doing a diff with cg_diff, and seeing what that finds.

But it might just be simpler to revert the change again? First it caused miscompilation, then it caused a perf regression... maybe this code should just be left alone :)

nnethercote on 5 Mar 2020

❤1

I'll open a PR that reverts #69544. I'd like to spend some time investigating it as well, since I have no idea why this would affect performance and it seems like a good opportunity to learn. If you feel the urge to look into this, however, don't let me stop you! I'll be pretty slow.

ecstatic-morse on 5 Mar 2020

@rust-timer queue

ecstatic-morse on 7 Mar 2020

Awaiting bors try build completion

rust-timer on 7 Mar 2020

@rust-timer build 4bfc61c90b9066ee396739e074d6d05abcfda04e

ecstatic-morse on 7 Mar 2020

Queued 4bfc61c90b9066ee396739e074d6d05abcfda04e with parent c79f5f064725535f7520e693e69c65c3d0f2730f, future comparison URL.

rust-timer on 7 Mar 2020

For anyone who's curious, here are the results of cg_diff master rerevert-67174 on a clean ucd-check run. Negative numbers means the count decreased after reverting #67174.

Ir          I1mr   ILmr   Dr          D1mr      DLmr    Dw          D1mw    DLmw    
--------------------------------------------------------------------------------
-16,399,587 48,614 -1,909 -15,203,345 1,677,474 -18,799 -14,933,944 939,273 145,091  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir          I1mr   ILmr  Dr          D1mr      DLmr    Dw          D1mw    DLmw     file:function
--------------------------------------------------------------------------------
-12,941,229    644    12 -14,697,113    22,667  33,296 -14,697,093 196,198 103,505  /usr/src/debug/glibc-2.30-34-g994e529a37/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
  5,834,640    309   116   1,361,416    22,558   8,013   1,944,880       0       0  ???:_ZN4core3ptr13drop_in_place17h2d334451ff604121E.llvm.3047232575757559832
 -5,834,640   -309   -93  -1,361,416   -22,305  -8,147  -1,944,880       0       0  ???:_ZN4core3ptr13drop_in_place17h2d334451ff604121E.llvm.14951046493754649665
 -4,704,001 -6,786  -275  -1,031,214   -29,888   6,301    -802,849  13,182   4,175  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:_int_malloc
  2,660,698  2,227     0     958,237        -6       0     481,776      20       0  ???:_ZN5cargo4util8progress5State5print17h25c4eb5dec857e24E.llvm.17714790172995471728
   -826,762  1,803   -15    -153,015   -82,049 -46,742    -153,358   1,001     647  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:malloc_consolidate
   -806,184   -651   -24    -201,566   -64,090       9      52,412  -6,989     -10  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:malloc
   -591,222  5,241   -46    -148,617    -4,060     753      84,645  10,875    -157  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:_int_free
    403,708     40    -2         510         0       0         255       0       0  ???:core::intrinsics::is_nonoverlapping
   -301,845   -275   -26    -115,970   -43,771 -44,619     -33,584       0       0  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:unlink_chunk.isra.0
    196,931    677   108      54,384    -2,842      -2      43,974   2,948     -19  ???:alloc::raw_vec::RawVec<T,A>::reserve
    186,666    802   -34      40,289    -4,337    -187      14,859     -53      -6  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:realloc
    146,690 -1,963   -44      31,542     7,476    -668      22,101     -61      -4  /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:_int_realloc
    131,682    342     0      33,170        13       0      32,318      -1       0  ???:_ZN5cargo4util8progress5State4tick17h24f78f2baccd5adbE.llvm.17714790172995471728
    -58,890   -820   203     -14,193    58,285  -2,498      -3,741  -3,269      -2  ???:hashbrown::map::RawEntryBuilderMut<K,V,S>::from_hash
    -31,850    777   -10      -7,350      -721     -38           0       0       0  ???:hashbrown::raw::RawTable<T>::bucket
     25,476      0     0      10,615         0       0           0       0       0  /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a//src/libstd/sys/unix/alloc.rs:__rdl_realloc
     16,984      1     0           0         0       0      10,615       0       0  /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a//src/libstd/alloc.rs:__rdl_realloc

No idea how to attribute this. Maybe it's worth noting that alloc::RawVec::reserve calls Layout::repeat via the following chain:

alloc::RawVec::reserve
alloc::RawVec::reserve_internal
Layout::array
Layout::repeat

ecstatic-morse on 7 Mar 2020

The Cachegrind diff basically says that the checked_add removal caused additional memcpy and malloc calls. I'm not sure what to make of that, I can't think of an explanation.

(BTW, I usually run Cachegrind with --cache-sim=no because 99% of the time I find instruction counts (Ir) to be the only useful metric.)

nnethercote on 7 Mar 2020

👍1

Finished benchmarking try commit 4bfc61c90b9066ee396739e074d6d05abcfda04e, comparison URL.

rust-timer on 7 Mar 2020

The different performance might be strictly about the fact that a different
code is actually being compiled and optimized, since the Layout is part of the
standard library.

It would be interesting to test compile time with different core but exactly
the same rustc.