Rust: Huge stack allocation is generated when assigning a huge piece of memory to a reference

Created on 14 May 2020  路  10Comments  路  Source: rust-lang/rust

Hi!

I am developing an application for an armv7 embedded device. I'm holding the application state in a stack-allocated enum, where one of the enum variations is 16k (total memory is 64k)

Switching to the huge app state, my microcontroller hard faults. I have played with a short repro on godbolt.org and I found that a particular pattern reserves stack space where I don't believe is necessary.

I tried this code:

pub fn init(a: usize) -> [usize; 1600] {
    [a; 1600]
}

pub fn fun(bar: &mut [usize; 1600], w: bool, a: usize) {
    *bar = init(a);
}

I expected to see this happen:
generated machine code fills bar directly

Instead, this happened:
generated machine code fills stack then copies to bar

Code generated with -C opt-level=3 --target=armv7-unknown-linux-gnueabihf on godbolt.org

example::fun:
        push    {r7, lr}
        sub.w   sp, sp, #2000 ; I don't like this stack alloc
        movs    r1, #0
        mov     r3, sp
.LBB1_1:
        str     r2, [r3, r1]
        adds    r1, #4
        cmp.w   r1, #2000
        bne     .LBB1_1
        mov     r1, sp
        mov.w   r2, #2000
        bl      __aeabi_memcpy4
        add.w   sp, sp, #2000
        pop     {r7, pc}

rustc --version --verbose:

1.43 and newer
A-mir-opt A-mir-opt-nrvo C-enhancement I-heavy I-slow T-compiler

Most helpful comment

edit

This comment is wrong. #72205 has no effect on this code.

FWIW, compiling this with #72205 applied results in the following assembly for fun (here renamed to foo) on x86_64.

00000000000502d0 <_ZN4test3foo17h1991a29ca1711cdfE>:
   502d0:   41 56                   push   r14
   502d2:   53                      push   rbx
   502d3:   50                      push   rax
   502d4:   49 89 f6                mov    r14,rsi
   502d7:   48 89 fb                mov    rbx,rdi
   502da:   ba 00 04 00 00          mov    edx,0x400
   502df:   31 f6                   xor    esi,esi
   502e1:   ff 15 d1 d1 0d 00       call   QWORD PTR [rip+0xdd1d1]        # 12d4b8 <memset@GLIBC_2.2.5>
   502e7:   48 89 df                mov    rdi,rbx
   502ea:   41 ff d6                call   r14
   502ed:   48 89 d8                mov    rax,rbx
   502f0:   48 83 c4 08             add    rsp,0x8
   502f4:   5b                      pop    rbx
   502f5:   41 5e                   pop    r14
   502f7:   c3                      ret    
   502f8:   0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
   502ff:   00 

Versus nightly:

00000000000492b0 <_ZN4test3foo17h1991a29ca1711cdfE>:
   492b0:   41 57                   push   r15
   492b2:   41 56                   push   r14
   492b4:   53                      push   rbx
   492b5:   48 81 ec 00 04 00 00    sub    rsp,0x400
   492bc:   49 89 f6                mov    r14,rsi
   492bf:   48 89 fb                mov    rbx,rdi
   492c2:   49 89 e7                mov    r15,rsp
   492c5:   ba 00 04 00 00          mov    edx,0x400
   492ca:   4c 89 ff                mov    rdi,r15
   492cd:   31 f6                   xor    esi,esi
   492cf:   ff 15 43 93 0b 00       call   QWORD PTR [rip+0xb9343]        # 102618 <memset@GLIBC_2.2.5>
   492d5:   4c 89 ff                mov    rdi,r15
   492d8:   41 ff d6                call   r14
   492db:   ba 00 04 00 00          mov    edx,0x400
   492e0:   48 89 df                mov    rdi,rbx
   492e3:   4c 89 fe                mov    rsi,r15
   492e6:   ff 15 c4 95 0b 00       call   QWORD PTR [rip+0xb95c4]        # 1028b0 <memcpy@GLIBC_2.14>
   492ec:   48 89 d8                mov    rax,rbx
   492ef:   48 81 c4 00 04 00 00    add    rsp,0x400
   492f6:   5b                      pop    rbx
   492f7:   41 5e                   pop    r14
   492f9:   41 5f                   pop    r15
   492fb:   c3                      ret    
   492fc:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

Generated with rustc --crate-type=dylib -O test.rs.

All 10 comments

fun has to pass a return value pointer to init, what you鈥檙e observing is that the return value pointer that gets passed is a pointer to a stack allocated temporary. The solution to this would be to detect situations in which return values are immediately written and pass the target of the write as the return address pointer instead.

This is a classic example of something that would be fixed by an NRVO pass.

@jonas-schievink is this something that the MIR optimisation wg is looking at?

edit

This comment is wrong. #72205 has no effect on this code.

FWIW, compiling this with #72205 applied results in the following assembly for fun (here renamed to foo) on x86_64.

00000000000502d0 <_ZN4test3foo17h1991a29ca1711cdfE>:
   502d0:   41 56                   push   r14
   502d2:   53                      push   rbx
   502d3:   50                      push   rax
   502d4:   49 89 f6                mov    r14,rsi
   502d7:   48 89 fb                mov    rbx,rdi
   502da:   ba 00 04 00 00          mov    edx,0x400
   502df:   31 f6                   xor    esi,esi
   502e1:   ff 15 d1 d1 0d 00       call   QWORD PTR [rip+0xdd1d1]        # 12d4b8 <memset@GLIBC_2.2.5>
   502e7:   48 89 df                mov    rdi,rbx
   502ea:   41 ff d6                call   r14
   502ed:   48 89 d8                mov    rax,rbx
   502f0:   48 83 c4 08             add    rsp,0x8
   502f4:   5b                      pop    rbx
   502f5:   41 5e                   pop    r14
   502f7:   c3                      ret    
   502f8:   0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
   502ff:   00 

Versus nightly:

00000000000492b0 <_ZN4test3foo17h1991a29ca1711cdfE>:
   492b0:   41 57                   push   r15
   492b2:   41 56                   push   r14
   492b4:   53                      push   rbx
   492b5:   48 81 ec 00 04 00 00    sub    rsp,0x400
   492bc:   49 89 f6                mov    r14,rsi
   492bf:   48 89 fb                mov    rbx,rdi
   492c2:   49 89 e7                mov    r15,rsp
   492c5:   ba 00 04 00 00          mov    edx,0x400
   492ca:   4c 89 ff                mov    rdi,r15
   492cd:   31 f6                   xor    esi,esi
   492cf:   ff 15 43 93 0b 00       call   QWORD PTR [rip+0xb9343]        # 102618 <memset@GLIBC_2.2.5>
   492d5:   4c 89 ff                mov    rdi,r15
   492d8:   41 ff d6                call   r14
   492db:   ba 00 04 00 00          mov    edx,0x400
   492e0:   48 89 df                mov    rdi,rbx
   492e3:   4c 89 fe                mov    rsi,r15
   492e6:   ff 15 c4 95 0b 00       call   QWORD PTR [rip+0xb95c4]        # 1028b0 <memcpy@GLIBC_2.14>
   492ec:   48 89 d8                mov    rax,rbx
   492ef:   48 81 c4 00 04 00 00    add    rsp,0x400
   492f6:   5b                      pop    rbx
   492f7:   41 5e                   pop    r14
   492f9:   41 5f                   pop    r15
   492fb:   c3                      ret    
   492fc:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

Generated with rustc --crate-type=dylib -O test.rs.

@ecstatic-morse How did you get those results, I cannot reproduce this simply merging #72205 and compiling the given code while RUSTC_STAGE=1 is set in the environment.

I also fail to see how NRVO would apply anyway. The extra copy is due to the temporary created in the caller, which uses an out-pointer and has a return type of (), so NRVO doesn't seem useful, and the callee is already writing straight to the return value anyway. Any pointers?

There is a redundant assignment in fun, which can be eliminated by destination propagation (we tend to call this NRVO, which we should probably fix since it isn't as limited).

I would not expect #72205 to catch that though, it intentionally only looks at assignments to _0.

How embarrassing. I can also no longer reproduce. I suspect I forgot to write the test.rs file before running the command I used to generate these results, which by the way is:

rustc +nightly --crate-type=dylib -O test.rs && objdump -d -Mintel libtest.so | awk -v RS='' '/foo/' 

I believe I reposted the results of the nrvo-simple test introduced in #72205. Note the size of the stack allocation, which is 0x400 (1024) instead of 0x640 (1600). Thanks for checking my work.

If @jonas-schievink is correct and there is a missed optimization here, #72205 is too dumb to enable it.

@jonas-schievink so a slight misuse of terms that had me confused, ok :-)

@ecstatic-morse that explains it then. I think I have a well established history of getting slightly over-excited about my results and accidentally posting wrong results ;-)

There clearly is a missed optimization here, which the existing copy propagation pass could have picked up, if it wasn't limited to locals as destinations.

That said, I wonder why call slot propagation in LLVM doesn't handle it either. I'll try to look into that...

Ok, so inlining runs first, thus the call slot optimization never triggers here.

If we mark init() as inline(never), then the call slot optimization runs but fails, because the mutable reference bar doesn't get marked as noalias (see https://github.com/rust-lang/rust/issues/54878), so the optimizer can't be sure that init() doesn't access the memory referenced by bar.h Using -Zmutable-noalias=yes fixes this and gets rid of the memcpy at the cost of not having init() inlined.

When init() gets inlined, we unfortunately get IR like this:

loop_block:
   ; yadda yadda
   br %done, label %memcpy_block, label %loop_block

memcpy_block:
    call void @llvm.memcpy(...)

Which means that we have a memcpy dependency that crosses a basic block boundary, and LLVM's memcpy optimizer doesn't handle these. There have been attempts to make it do that (for example by Patrick Walton, years ago), but AFAIK nothing has made to into LLVM (yet?).

@dotdash No cross-BB memcpy opt in LLVM yet. The last attempt was reverted due to miscompiles, and this has not been followup up on. I believe the current stance on this is that it should only get handled once memcpyopt is migrated to MSSA. As there's ongoing work on MSSA-based DSE, I have some small hopes that this might actually happen.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Leo1003 picture Leo1003  路  898Comments

Mark-Simulacrum picture Mark-Simulacrum  路  681Comments

alexcrichton picture alexcrichton  路  240Comments

Centril picture Centril  路  382Comments

nikomatsakis picture nikomatsakis  路  268Comments