checked_add in Layout::repeat")syntax in librustc_ast/README.md)I would not expect any of them to negatively impact performance, although I did do a perf run with an unchecked_add in Layout::repeat to no avail (see #69679). The regression is only noticeable in clean/patched incremental builds, so I don't think this is high priority, but it would be nice to know what's going on here.
cc @nnethercote @jonas-schievink @Mark-Simulacrum
The perf results for #69241, the original revert of #67174, are basically the mirror of the regression mentioned above. #67220, the first rollup to include #67174, also caused a performance regression, but it was attributed to a different PR in that rollup.
Therefore, I'm confident that #67174/#69544 is the root cause of this regression.
Layout::repeat does not wrap. Assuming that the invariant cited in #67174 does in fact hold, my only hypothesis is that removing the ? operator causes Layout::repeat to become eligible for inlining, which in turn makes one of its callees large enough to exceed the inlining threshold.In any case, this seems like it should interest @nnethercote, since it indicates that the code in Layout is hot enough to benefit from micro-optimization. TBH, I'd also like to learn what approach they would use to investigate this kind of issue.
Here is some output from a Callgrind profile that I had lying around. This is for a Check-CleanIncr run of ucd, which saw a 4.6% regression:
. pub fn repeat(&self, n: usize) -> Result<(Self, usize), LayoutErr> {
. // Warning, removing the checked_add here led to segfaults in #67174. Further
. // analysis in #69225 seems to indicate that this is an LTO-related
. // miscompilation, so #67174 might be able to be reapplied in the future.
. let padded_size = self
. .size()
. .checked_add(self.padding_needed_for(self.align()))
. .ok_or(LayoutErr { private: () })?;
1,313,433 ( 0.01%) let alloc_size = padded_size.checked_mul(n).ok_or(LayoutErr { private: () })?;
.
. unsafe {
. // self.align is already known to be valid and alloc_size has been
. // padded already.
. Ok((Layout::from_size_align_unchecked(alloc_size, self.align()), padded_size))
. }
. }
It suggests two things:
fn line and the closing brace line.)I can do a proper analysis tomorrow if that would be helpful. That will likely involve running Cachegrind on ucd-Check-CleanIncr with both versions of the compiler, and then doing a diff with cg_diff, and seeing what that finds.
But it might just be simpler to revert the change again? First it caused miscompilation, then it caused a perf regression... maybe this code should just be left alone :)
I'll open a PR that reverts #69544. I'd like to spend some time investigating it as well, since I have no idea why this would affect performance and it seems like a good opportunity to learn. If you feel the urge to look into this, however, don't let me stop you! I'll be pretty slow.
@rust-timer queue
Awaiting bors try build completion
@rust-timer build 4bfc61c90b9066ee396739e074d6d05abcfda04e
Queued 4bfc61c90b9066ee396739e074d6d05abcfda04e with parent c79f5f064725535f7520e693e69c65c3d0f2730f, future comparison URL.
For anyone who's curious, here are the results of cg_diff master rerevert-67174 on a clean ucd-check run. Negative numbers means the count decreased after reverting #67174.
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
-16,399,587 48,614 -1,909 -15,203,345 1,677,474 -18,799 -14,933,944 939,273 145,091 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
-12,941,229 644 12 -14,697,113 22,667 33,296 -14,697,093 196,198 103,505 /usr/src/debug/glibc-2.30-34-g994e529a37/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
5,834,640 309 116 1,361,416 22,558 8,013 1,944,880 0 0 ???:_ZN4core3ptr13drop_in_place17h2d334451ff604121E.llvm.3047232575757559832
-5,834,640 -309 -93 -1,361,416 -22,305 -8,147 -1,944,880 0 0 ???:_ZN4core3ptr13drop_in_place17h2d334451ff604121E.llvm.14951046493754649665
-4,704,001 -6,786 -275 -1,031,214 -29,888 6,301 -802,849 13,182 4,175 /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:_int_malloc
2,660,698 2,227 0 958,237 -6 0 481,776 20 0 ???:_ZN5cargo4util8progress5State5print17h25c4eb5dec857e24E.llvm.17714790172995471728
-826,762 1,803 -15 -153,015 -82,049 -46,742 -153,358 1,001 647 /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:malloc_consolidate
-806,184 -651 -24 -201,566 -64,090 9 52,412 -6,989 -10 /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:malloc
-591,222 5,241 -46 -148,617 -4,060 753 84,645 10,875 -157 /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:_int_free
403,708 40 -2 510 0 0 255 0 0 ???:core::intrinsics::is_nonoverlapping
-301,845 -275 -26 -115,970 -43,771 -44,619 -33,584 0 0 /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:unlink_chunk.isra.0
196,931 677 108 54,384 -2,842 -2 43,974 2,948 -19 ???:alloc::raw_vec::RawVec<T,A>::reserve
186,666 802 -34 40,289 -4,337 -187 14,859 -53 -6 /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:realloc
146,690 -1,963 -44 31,542 7,476 -668 22,101 -61 -4 /usr/src/debug/glibc-2.30-34-g994e529a37/malloc/malloc.c:_int_realloc
131,682 342 0 33,170 13 0 32,318 -1 0 ???:_ZN5cargo4util8progress5State4tick17h24f78f2baccd5adbE.llvm.17714790172995471728
-58,890 -820 203 -14,193 58,285 -2,498 -3,741 -3,269 -2 ???:hashbrown::map::RawEntryBuilderMut<K,V,S>::from_hash
-31,850 777 -10 -7,350 -721 -38 0 0 0 ???:hashbrown::raw::RawTable<T>::bucket
25,476 0 0 10,615 0 0 0 0 0 /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a//src/libstd/sys/unix/alloc.rs:__rdl_realloc
16,984 1 0 0 0 0 10,615 0 0 /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a//src/libstd/alloc.rs:__rdl_realloc
No idea how to attribute this. Maybe it's worth noting that alloc::RawVec::reserve calls Layout::repeat via the following chain:
alloc::RawVec::reserve
alloc::RawVec::reserve_internal
Layout::array
Layout::repeat
The Cachegrind diff basically says that the checked_add removal caused additional memcpy and malloc calls. I'm not sure what to make of that, I can't think of an explanation.
(BTW, I usually run Cachegrind with --cache-sim=no because 99% of the time I find instruction counts (Ir) to be the only useful metric.)
Finished benchmarking try commit 4bfc61c90b9066ee396739e074d6d05abcfda04e, comparison URL.
The different performance might be strictly about the fact that a different
code is actually being compiled and optimized, since the Layout is part of the
standard library.
It would be interesting to test compile time with different core but exactly
the same rustc.
The regression ultimately disappeared after #69879 (likely #69799).
@TimDiekmann It appears that minor changes to innocent-looking code in alloc.rs can result in relatively large perf changes. I suspect this will continue as alloc.rs is refactored. I think it would be a good idea to rollup=never or at least do a perf run on PRs that touch the allocator traits so we can more precisely attribute perf regressions. To clarify, this does not mean that changes should be blocked if they do indeed cause a regression.
The next upcoming refactoring of AllocRef also regressed the performance slightly: #69889
Most helpful comment
The regression ultimately disappeared after #69879 (likely #69799).
@TimDiekmann It appears that minor changes to innocent-looking code in
alloc.rscan result in relatively large perf changes. I suspect this will continue asalloc.rsis refactored. I think it would be a good idea torollup=neveror at least do a perf run on PRs that touch the allocator traits so we can more precisely attribute perf regressions. To clarify, this does not mean that changes should be blocked if they do indeed cause a regression.