Hi!
I am developing an application for an armv7 embedded device. I'm holding the application state in a stack-allocated enum, where one of the enum variations is 16k (total memory is 64k)
Switching to the huge app state, my microcontroller hard faults. I have played with a short repro on godbolt.org and I found that a particular pattern reserves stack space where I don't believe is necessary.
I tried this code:
pub fn init(a: usize) -> [usize; 1600] {
[a; 1600]
}
pub fn fun(bar: &mut [usize; 1600], w: bool, a: usize) {
*bar = init(a);
}
I expected to see this happen:
generated machine code fills bar directly
Instead, this happened:
generated machine code fills stack then copies to bar
Code generated with -C opt-level=3 --target=armv7-unknown-linux-gnueabihf on godbolt.org
example::fun:
push {r7, lr}
sub.w sp, sp, #2000 ; I don't like this stack alloc
movs r1, #0
mov r3, sp
.LBB1_1:
str r2, [r3, r1]
adds r1, #4
cmp.w r1, #2000
bne .LBB1_1
mov r1, sp
mov.w r2, #2000
bl __aeabi_memcpy4
add.w sp, sp, #2000
pop {r7, pc}
rustc --version --verbose:
1.43 and newer
fun has to pass a return value pointer to init, what you鈥檙e observing is that the return value pointer that gets passed is a pointer to a stack allocated temporary. The solution to this would be to detect situations in which return values are immediately written and pass the target of the write as the return address pointer instead.
This is a classic example of something that would be fixed by an NRVO pass.
@jonas-schievink is this something that the MIR optimisation wg is looking at?
edit
This comment is wrong. #72205 has no effect on this code.
FWIW, compiling this with #72205 applied results in the following assembly for fun (here renamed to foo) on x86_64.
00000000000502d0 <_ZN4test3foo17h1991a29ca1711cdfE>:
502d0: 41 56 push r14
502d2: 53 push rbx
502d3: 50 push rax
502d4: 49 89 f6 mov r14,rsi
502d7: 48 89 fb mov rbx,rdi
502da: ba 00 04 00 00 mov edx,0x400
502df: 31 f6 xor esi,esi
502e1: ff 15 d1 d1 0d 00 call QWORD PTR [rip+0xdd1d1] # 12d4b8 <memset@GLIBC_2.2.5>
502e7: 48 89 df mov rdi,rbx
502ea: 41 ff d6 call r14
502ed: 48 89 d8 mov rax,rbx
502f0: 48 83 c4 08 add rsp,0x8
502f4: 5b pop rbx
502f5: 41 5e pop r14
502f7: c3 ret
502f8: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
502ff: 00
Versus nightly:
00000000000492b0 <_ZN4test3foo17h1991a29ca1711cdfE>:
492b0: 41 57 push r15
492b2: 41 56 push r14
492b4: 53 push rbx
492b5: 48 81 ec 00 04 00 00 sub rsp,0x400
492bc: 49 89 f6 mov r14,rsi
492bf: 48 89 fb mov rbx,rdi
492c2: 49 89 e7 mov r15,rsp
492c5: ba 00 04 00 00 mov edx,0x400
492ca: 4c 89 ff mov rdi,r15
492cd: 31 f6 xor esi,esi
492cf: ff 15 43 93 0b 00 call QWORD PTR [rip+0xb9343] # 102618 <memset@GLIBC_2.2.5>
492d5: 4c 89 ff mov rdi,r15
492d8: 41 ff d6 call r14
492db: ba 00 04 00 00 mov edx,0x400
492e0: 48 89 df mov rdi,rbx
492e3: 4c 89 fe mov rsi,r15
492e6: ff 15 c4 95 0b 00 call QWORD PTR [rip+0xb95c4] # 1028b0 <memcpy@GLIBC_2.14>
492ec: 48 89 d8 mov rax,rbx
492ef: 48 81 c4 00 04 00 00 add rsp,0x400
492f6: 5b pop rbx
492f7: 41 5e pop r14
492f9: 41 5f pop r15
492fb: c3 ret
492fc: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
Generated with rustc --crate-type=dylib -O test.rs.
@ecstatic-morse How did you get those results, I cannot reproduce this simply merging #72205 and compiling the given code while RUSTC_STAGE=1 is set in the environment.
I also fail to see how NRVO would apply anyway. The extra copy is due to the temporary created in the caller, which uses an out-pointer and has a return type of (), so NRVO doesn't seem useful, and the callee is already writing straight to the return value anyway. Any pointers?
There is a redundant assignment in fun, which can be eliminated by destination propagation (we tend to call this NRVO, which we should probably fix since it isn't as limited).
I would not expect #72205 to catch that though, it intentionally only looks at assignments to _0.
How embarrassing. I can also no longer reproduce. I suspect I forgot to write the test.rs file before running the command I used to generate these results, which by the way is:
rustc +nightly --crate-type=dylib -O test.rs && objdump -d -Mintel libtest.so | awk -v RS='' '/foo/'
I believe I reposted the results of the nrvo-simple test introduced in #72205. Note the size of the stack allocation, which is 0x400 (1024) instead of 0x640 (1600). Thanks for checking my work.
If @jonas-schievink is correct and there is a missed optimization here, #72205 is too dumb to enable it.
@jonas-schievink so a slight misuse of terms that had me confused, ok :-)
@ecstatic-morse that explains it then. I think I have a well established history of getting slightly over-excited about my results and accidentally posting wrong results ;-)
There clearly is a missed optimization here, which the existing copy propagation pass could have picked up, if it wasn't limited to locals as destinations.
That said, I wonder why call slot propagation in LLVM doesn't handle it either. I'll try to look into that...
Ok, so inlining runs first, thus the call slot optimization never triggers here.
If we mark init() as inline(never), then the call slot optimization runs but fails, because the mutable reference bar doesn't get marked as noalias (see https://github.com/rust-lang/rust/issues/54878), so the optimizer can't be sure that init() doesn't access the memory referenced by bar.h Using -Zmutable-noalias=yes fixes this and gets rid of the memcpy at the cost of not having init() inlined.
When init() gets inlined, we unfortunately get IR like this:
loop_block:
; yadda yadda
br %done, label %memcpy_block, label %loop_block
memcpy_block:
call void @llvm.memcpy(...)
Which means that we have a memcpy dependency that crosses a basic block boundary, and LLVM's memcpy optimizer doesn't handle these. There have been attempts to make it do that (for example by Patrick Walton, years ago), but AFAIK nothing has made to into LLVM (yet?).
@dotdash No cross-BB memcpy opt in LLVM yet. The last attempt was reverted due to miscompiles, and this has not been followup up on. I believe the current stance on this is that it should only get handled once memcpyopt is migrated to MSSA. As there's ongoing work on MSSA-based DSE, I have some small hopes that this might actually happen.
Most helpful comment
edit
This comment is wrong. #72205 has no effect on this code.
FWIW, compiling this with #72205 applied results in the following assembly forfun(here renamed tofoo) on x86_64.Versus nightly:
Generated with
rustc --crate-type=dylib -O test.rs.