Rust: Weird copy happening during initialization

Created on 24 Nov 2018  Â·  14Comments  Â·  Source: rust-lang/rust

This code:

#[inline(never)]
pub fn do_ants() {
    do_item(&Outer {
                item: SpecificDisplayItem2::PopStackingContext,
                info: LayoutPrimitiveInfo2::new(),
            });
}

#[derive(Copy, Clone)]
pub struct LayoutPrimitiveInfo2 {
    pub rect: [f32; 7],
    pub is_backface_visible: u16,
}

impl LayoutPrimitiveInfo2 {
    fn new() -> Self {
        Self {
            rect: [0.; 7],
            is_backface_visible:0,
        }
    }
}

pub enum SpecificDisplayItem2 {
    PopStackingContext,
    Other([f64; 22]),
}

struct Outer {
    item: SpecificDisplayItem2,
    info: LayoutPrimitiveInfo2
}

fn do_item(di: &Outer) { unsafe { ext(di) } }
extern {
    fn ext(di: &Outer);
}

compiles to

bad_pop_stacking_context::do_ants:
 mov     rbp, rsp
 sub     rsp, 256
 xorps   xmm0, xmm0
 movaps  xmmword, ptr, [rbp, -, 32], xmm0
 mov     qword, ptr, [rbp, -, 10], 0
 mov     qword, ptr, [rbp, -, 16], 0
 mov     qword, ptr, [rbp, -, 248], 0
 mov     rax, qword, ptr, [rbp, -, 32]
 mov     rcx, qword, ptr, [rbp, -, 24]
 mov     qword, ptr, [rbp, -, 64], rax
 mov     qword, ptr, [rbp, -, 56], rcx
 mov     rax, qword, ptr, [rbp, -, 16]
 mov     qword, ptr, [rbp, -, 48], rax
 mov     rax, qword, ptr, [rbp, -, 8]
 mov     qword, ptr, [rbp, -, 40], rax
 lea     rdi, [rbp, -, 248]
 call    _ext
 add     rsp, 256
 pop     rbp
 ret

When the u16 is changed to a u32 we get the saner:

bad_pop_stacking_context::do_ants:
 mov     rbp, rsp
 sub     rsp, 224
 mov     qword, ptr, [rbp, -, 216], 0
 mov     qword, ptr, [rbp, -, 8], 0
 mov     qword, ptr, [rbp, -, 16], 0
 mov     qword, ptr, [rbp, -, 24], 0
 mov     qword, ptr, [rbp, -, 32], 0
 lea     rdi, [rbp, -, 216]
 call    _ext
 add     rsp, 224
 pop     rbp
 ret
A-LLVM I-slow

Most helpful comment

I've submitted https://reviews.llvm.org/D55120.

Ultimately this is just papering things over though. I think a more general solution would be to instead check if a memcpy can be hoisted upwards past a store/memset chain and absorbed into an alloca. Basically what call slot optimization does, just not for calls.

All 14 comments

You should be able to reproduce any of these with C++ code in clang and report them as LLVM bugs.

But also, it helps to explain what is weird here. In this case, my guess is the reads back out from the stack? Also, we should ignore the lack of SSE usage in the saner example, at least for now, right?

cc @rust-lang/wg-codegen Can someone look at the LLVM IR here and try to figure out if there's anything we could tell LLVM that would make it not do this?

The code we generate for the structure and its initalization looks something like this:

%LayoutPrimitiveInfo2 = type { [0 x i32], [7 x float], [0 x i16], i16, [1 x i16] }

; in a function
; loop to initialize all floats to 0
; then...
%9 = getelementptr inbounds %LayoutPrimitiveInfo2, %LayoutPrimitiveInfo2* %0, i32 0, i32 3
store i16 0, i16* %9, align 2
; and implicitly
; %10 = getelementptr inbounds %LayoutPrimitiveInfo2, %LayoutPrimitiveInfo2* %0, i32 0, i32 4
; store [1 x i16] undef, [1 x i16]* %10, align 2

this undef seems to be the thing that prevents proper optimisation from taking place. Explicitly adding intialization of padding to 0:

%10 = getelementptr inbounds %LayoutPrimitiveInfo2, %LayoutPrimitiveInfo2* %0, i32 0, i32 4
store [1 x i16] zersoinitializer, [1 x i16]* %10, align 2

makes GVN generate a memset for the whole structure, rather than a part of it, which then makes instcombine happier and prone to optimizing out stuff better.

Hmm, can we indicate to LLVM that it could put anything in the padding?
I'd rather not initialize padding ourselves since that can be a deoptimization.

EDIT: my bad, was looking at the wrong thing. LLVM is able to reach the code below with is_backface_visible: u32.


Wrong stuff

FWIW we do not reach the "ideal" codegen in either scenario. Ideally the code would look like this

do_ants:
# %bb.0:
    subq    $216, %rsp
    movq    $0, (%rsp)
    xorps   %xmm0, %xmm0
    movups  %xmm0, 200(%rsp)
    movups  %xmm0, 184(%rsp)
    movq    %rsp, %rdi
    callq   ext
    addq    $216, %rsp
    retq

which can be achieved by replacing initialization with usnafe { ::std::mem::zeroed() }.


Hmm, can we indicate to LLVM that it could put anything in the padding?

LLVM already knows (the padding is undef), but that is exactly what prevents optimizations from kicking in from what I can see. I cannot really tell the exact underlying analysis/optimisation that falls short, and I’m probably not going to spend time on this any time soon.

@nikic might be interested, though.

LLVM already knows (the padding is undef), but that is exactly what prevents optimizations from kicking in from what I can see.

Does it know it could write there, though? That is, undef vs "value is unmodified".
Or is it optimizing all of this on an alloca?

is it optimizing all of this on an alloca?

Yes.

I don't think that memset formation is the right place to optimize here. In the general case extending a partial memset to a memset of the full structure may not necessarily be advantageous.

It would be preferable to handle this during memset-memcpy forwarding. I have something like https://gist.github.com/nikic/3ca2f08c70bf8d9c0a634d42d5533f17 in mind, which would be enough to cover this case. This will fall short though if there is interior padding.

The equivalent c++ generates saner code regardless of the type of is_backface_visible

#include <stdint.h>
#include <cstring>

struct LayoutPrimitiveInfo2 {
        LayoutPrimitiveInfo2() {
                memset(rect, 0, sizeof(rect));
                is_backface_visible = 0;
        }
        float rect[7];
        uint16_t is_backface_visible;
};

struct SpecificDisplayItem2 {
    static SpecificDisplayItem2 PopStackingContext() {
            SpecificDisplayItem2 ret;
            ret.p.a.disc = 0;
            return ret;
    }
    union {
            struct A {
                    uint64_t disc;
            } a;
            struct B {
                    uint64_t disc;
                    double f[22];
            } b;
    } p;
};

struct Outer {
    SpecificDisplayItem2 item;
    LayoutPrimitiveInfo2 info;
};

extern int do_item(Outer &o);

void do_ants() {
    Outer o =  {
                SpecificDisplayItem2::PopStackingContext(),
                LayoutPrimitiveInfo2(),
            };
    do_item(o);
}

do_ants(): # @do_ants()
  sub rsp, 216
  mov qword ptr [rsp], 0
  xorps xmm0, xmm0
  movups xmmword ptr [rsp + 198], xmm0
  movups xmmword ptr [rsp + 184], xmm0
  mov rdi, rsp
  call do_item(Outer&)
  add rsp, 216
  ret

I've submitted https://reviews.llvm.org/D55120.

Ultimately this is just papering things over though. I think a more general solution would be to instead check if a memcpy can be hoisted upwards past a store/memset chain and absorbed into an alloca. Basically what call slot optimization does, just not for calls.

Patch landed upstream, so this is waiting on an LLVM update now.

Would it be possible to cherry pick this into rust's llvm ahead of a new
llvm release?

On Sat, Dec 8, 2018, 12:51 PM Nikita Popov <[email protected] wrote:

Patch landed upstream, so this is waiting on an LLVM update now.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rust/issues/56191#issuecomment-445477110,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAUTbZhNpxUa7cNhNvFhLPZi4JaX9voTks5u2_wHgaJpZM4YxPtq
.

Yes, I think it would be reasonable to cherry-pick this change. I'd like to wait for a few miscompile fixes to land first, so we can batch them together.

@nikic the llvm patch has baked for a while now. Do you think you could cherry pick it into rust's llvm?

Fixed by #57675

We now get:

playground::do_ants: # @playground::do_ants
# %bb.0:
    subq    $216, %rsp
    movq    $0, (%rsp)
    xorps   %xmm0, %xmm0
    movups  %xmm0, 198(%rsp)
    movups  %xmm0, 184(%rsp)
    movq    %rsp, %rdi
    callq   *ext@GOTPCREL(%rip)
    addq    $216, %rsp
    retq
Was this page helpful?
0 / 5 - 0 ratings