Rust: Async performance regression

Created on 28 Mar 2020 · 28Comments · Source: rust-lang/rust

See https://github.com/rust-lang/rust/pull/69033#issuecomment-605357808; @kpp reports a 200%+ regression on some futures microbenchmarks, e.g.:

future::ready/async_combinators                                                                             
                        time:   [1.5699 ns 1.5743 ns 1.5793 ns]
                        change: [+279.32% +282.34% +285.20%] (p = 0.00 < 0.05)
                        Performance has regressed.

The benchmarks are at https://github.com/kpp/futures-async-combinators/.

I bisected this to 57e1da59cd0761330b4ea8d47b16340a78eeafa9, which most likely implicates #69837.

A-LLVM A-async-await A-codegen E-needs-mcve I-nominated I-slow ICEBreaker-LLVM P-low T-compiler

Source

goffrie

Most helpful comment

For the "reproduction", the issue isn't the alignment; it's the way the example reimplements black_box in terms of a volatile load of an entire struct, as an attempt to make it work on stable. The actual std::hint::black_box is not implemented this way; if you change the example to use it, it will only work on nightly, but it doesn't show the wasteful assembly.

Longer explanation:

LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions. (Without volatile it will optimize them into something nicer.) Note that LLVM does this regardless of the size of struct, as long as the LLVM IR has a load of struct type (load volatile %AlignedBytes, %AlignedBytes* %src). But if the struct is small enough (say, 8 u8s), then rustc will instead produce LLVM IR containing a load of integer type (load volatile i64, i64* %1), so you get a single machine load instruction even though it's volatile. Interestingly, Clang does the same thing for C code. I wonder why.

Regardless, the result is unpredictable, nonintuitive, and undocumented, in both Rust and C. Luckily, nobody actually uses volatile loads/stores of entire structs in cases where it matters, like accessing hardware MMIO registers: the registers may or may not be represented in a struct, but the actual loads and stores are done to individual fields.

comex on 1 Apr 2020

👍6

All 28 comments

Can reproduce. The regression comes from changing the discriminant type from u32 to u8 (or whatever fits best, which almost always is u8), exposing the niche does not regress performance much.

Isn't x86 generally less efficient when dealing with <32 bit words? In that case we might want to consider doing this optimization conditionally, since it can be a somewhat significant RAM saving on smaller microcontroller targets. I could imagine using -C opt-level=s to decide this, but I don't think there's any precedence for doing something like that at the moment.

jonas-schievink on 28 Mar 2020

I think the slowdown is likely some sort of bug in our codegen or LLVM perhaps?

This is a single iteration on nightly (unrolled 8 times in the actual benchmark code):

    0.00 :   18a260:       mov    BYTE PTR [rsp+0x8],0x0
    0.80 :   18a265:       movzx  ecx,BYTE PTR [rsp]
    0.99 :   18a269:       movzx  ecx,BYTE PTR [rsp+0x1]
    1.09 :   18a26e:       movzx  ecx,BYTE PTR [rsp+0x2]
    0.87 :   18a273:       movzx  ecx,BYTE PTR [rsp+0x3]
    0.97 :   18a278:       movzx  ecx,BYTE PTR [rsp+0x4]
    0.83 :   18a27d:       movzx  ecx,BYTE PTR [rsp+0x5]
    1.32 :   18a282:       movzx  ecx,BYTE PTR [rsp+0x6]
    0.87 :   18a287:       movzx  ecx,BYTE PTR [rsp+0x7]
    0.84 :   18a28c:       movzx  ecx,BYTE PTR [rsp+0x9]
    0.86 :   18a291:       movzx  ecx,BYTE PTR [rsp+0xa]
    1.01 :   18a296:       movzx  ecx,BYTE PTR [rsp+0xb]
    1.04 :   18a29b:       movzx  ecx,BYTE PTR [rsp+0x8]

On the good commit, it looks like this (and is only unrolled 4 times):

   17.76 :   183f62:       mov    DWORD PTR [rsp],0x0
    0.00 :   183f69:       mov    edx,DWORD PTR [rsp+0x4]
    3.26 :   183f6d:       mov    edx,DWORD PTR [rsp+0x8]
    0.00 :   183f71:       mov    edx,DWORD PTR [rsp]

Trying to reduce this down, I came up with this example:

use std::future::Future;

pub async fn foo(v: i32) -> i32 { v }

macro_rules! wrap {
    ($v:expr) => {
        async move { ($v).await }
    }
}

pub fn bar(v: i32) -> impl Future<Output = i32> {
    wrap!(wrap!(wrap!(wrap!(wrap!(wrap!(foo(v)))))))
}

In godbolt, the nightly has an offset of 52 -- and that increases by 8 each additional wrap! -- whereas stable stays constant at 4, the size of i32 (presumably). I think that's the bug here, though I'm not sure how the discriminant size managed to affect that. Maybe previously we were getting split into two separate LLVM "words" via SROA or something and now we aren't since we fit into a single register or something like that? That could I guess plausibly lead to problems here; my assumption is that the benchmark is loading many more words from the stack on nightly because the "size" of the future has increased drastically? OTOH, size_of_val etc do not report a size increase -- so why the offset is increasing I'm not sure. But I suspect it's for the same reason we're seeing more stack loads.

Nightly:

example::foo:
        mov     eax, edi
        ret

example::bar:
        mov     rax, rdi
        mov     dword ptr [rdi], esi
        mov     byte ptr [rdi + 52], 0
        ret

and on stable:

example::foo:
        mov     eax, edi
        ret

example::bar:
        mov     rax, rdi
        mov     dword ptr [rdi], esi
        mov     dword ptr [rdi + 4], 0
        ret

Mark-Simulacrum on 28 Mar 2020

Thanks, that's helpful!

jonas-schievink on 28 Mar 2020

The dword vs. byte moves can be explained by the alignment changing from 4 (due to using u32 as the discriminant) to 1 (since we now use u8 as the discriminant), as long as nothing with a larger alignment is contained.

jonas-schievink on 28 Mar 2020

In godbolt, the nightly has an offset of 52 -- and that increases by 8 each additional wrap! -- whereas stable stays constant at 4, the size of i32 (presumably).

Not entirely certain, but I think this is just automatic field reordering doing its thing. The discriminant is now a u8 with alignment 1, so it gets moved to the very end of the memory layout to not waste space with padding. Previously it was a u32 with alignment 4, just like all the other fields in the generator, so reordering anything there doesn't make sense.

If I take your example and replace i32 with bool, the size of foo goes from 8 to 2 bytes (and its alignment from 4 to 1), while bar goes from 56 to 14 bytes (its alignment also goes from 4 to 1).

So far this looks like a fairly expected outcome of my PR; it's pretty well-known that smaller alignment can generate less efficient code (OTOH apparently modern x86 chips can do unaligned moves just as fast as aligned ones, so maybe the byte-wise moves should not be emitted, but in the end this is completely up to LLVM). Since, AFAICT, it should only affect futures that do not store any data with alignment in them, the impact should be pretty minimal in practice, so we can also consider closing this without a fix.

(this does make me wish that we had better layout debugging capabilities in rustc though)

jonas-schievink on 28 Mar 2020

this does make me wish that we had better layout debugging capabilities in rustc

A #[rustc_layout(debug)] attribute was added in #69901 (edit: heh, just noticed you're the first comment on it ^^). With the help of type_alias_impl_trait, rustc output some layout information for bar's future: playground

memoryruins on 28 Mar 2020

Ah, nice to see that land, that's certainly a step in the right direction

jonas-schievink on 28 Mar 2020

@jonas-schievink a quick test to see if this is indeed an alignment problem is to stick a #[repr(align(4))]-equivalent onto the generator.

nagisa on 28 Mar 2020

That does not seem to do anything, hmm... Maybe I'm doing it wrong though. Would be useful to get an MCVE that exhibits the inefficient codegen.

jonas-schievink on 28 Mar 2020

I've just checked the benchmark code, and it does not actually benchmark anything except copying the futures around. They are never polled because they are never spawned onto an executor.

The benchmark results would make sense here if the alignment is really the cause.

jonas-schievink on 28 Mar 2020

👀1 ❤1

How so?

    executor::block_on(async {
        let mut group = c.benchmark_group("future::ready");

        group.bench_function("futures", |b| {
            b.iter(move || async {
                black_box(futures::future::ready(42)).await
            })
        });

We got block_on and .await.

kpp on 28 Mar 2020

Enabling Criterion's real_blackbox feature significantly speeds up the future::ready benchmarks. The default implementation does more copying.

With the feature enabled the benchmarks now take <1ns, so yeah, they were really just benchmarking the memcpy in black_box.

jonas-schievink on 28 Mar 2020

@kpp block_on only blocks on the outer async block, which does not await anything. The futures created by the closure passed to iter are just dropped without being spawned or awaited (which is what is actually getting benchmarked here). You can test this by putting a panic!() inside the inner async blocks. It is never reached.

jonas-schievink on 28 Mar 2020

Ooops. Never felt more embarrassed.

kpp on 28 Mar 2020

Jonas, does this sounds better to you?

fn bench_ready(c: &mut Criterion) {
    let mut group = c.benchmark_group("future::ready");

    group.bench_function("futures", |b| {
        b.iter(|| {
            executor::block_on(async {
                black_box(futures::future::ready(42)).await;
            })
        })
    });

kpp on 28 Mar 2020

I'd wrap the block_on in a black_box to make sure the entire thing gets evaluated (and not just the expression producing the future)

jonas-schievink on 28 Mar 2020

black_box(futures::future::ready(42).await); ?

kpp on 28 Mar 2020

Reproduction:

use std::{mem, ptr};

use std::future::Future;

async fn ready<T>(t: T) -> T {
    t
}

fn black_box<T>(dummy: T) -> T {
    unsafe {
        let ret = std::ptr::read_volatile(&dummy);
        std::mem::forget(dummy);
        ret
    }
}

#[inline(never)]
fn iter<O, R>(mut routine: R)
where
    R: FnMut() -> O,
{
    loop {
        black_box(routine());
    }
}

pub unsafe fn ready_bench() {
    iter(move || async {
        black_box(ready(42)).await
    });
}

On stable:

playground::iter:
    sub rsp, 16

.LBB0_1:
    mov dword ptr [rsp], 0
    mov eax, dword ptr [rsp + 4]
    mov eax, dword ptr [rsp + 8]
    mov eax, dword ptr [rsp]
    jmp .LBB0_1

playground::ready_bench:
    push    rax
    call    playground::iter
    ud2

On nightly:

playground::iter: # @playground::iter
# %bb.0:
    sub rsp, 16

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
    mov byte ptr [rsp + 8], 0
    movzx   eax, byte ptr [rsp]
    movzx   eax, byte ptr [rsp + 1]
    movzx   eax, byte ptr [rsp + 2]
    movzx   eax, byte ptr [rsp + 3]
    movzx   eax, byte ptr [rsp + 4]
    movzx   eax, byte ptr [rsp + 5]
    movzx   eax, byte ptr [rsp + 6]
    movzx   eax, byte ptr [rsp + 7]
    movzx   eax, byte ptr [rsp + 9]
    movzx   eax, byte ptr [rsp + 10]
    movzx   eax, byte ptr [rsp + 11]
    movzx   eax, byte ptr [rsp + 8]
    jmp .LBB0_1
                                        # -- End function

playground::ready_bench: # @playground::ready_bench
# %bb.0:
    push    rax
    call    playground::iter
    ud2
                                        # -- End function

jonas-schievink on 29 Mar 2020

Hm, I guess this is not really a bug then.

On Sun., Mar. 29, 2020, 13:45 Jonas Schievink, notifications@github.com
wrote:

Reproduction
https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=fc0db5e71c27b16c010db95c9053b6ac
:

use std::{mem, ptr};
use std::future::Future;
async fn ready(t: T) -> T {
t
}
fn black_box(dummy: T) -> T {
unsafe {
let ret = std::ptr::read_volatile(&dummy);
std::mem::forget(dummy);
ret
}
}

[inline(never)]fn iter(mut routine: R)where
R: FnMut() -> O,
{
loop {
black_box(routine());
}
}
pub unsafe fn ready_bench() {
iter(move || async {
black_box(ready(42)).await
});
}

On stable:

playground::iter:
sub rsp, 16

.LBB0_1:
mov dword ptr [rsp], 0
mov eax, dword ptr [rsp + 4]
mov eax, dword ptr [rsp + 8]
mov eax, dword ptr [rsp]
jmp .LBB0_1

playground::ready_bench:
push rax
call playground::iter
ud2

On nightly:

playground::iter: # @playground::iter

%bb.0:

sub rsp, 16

.LBB0_1: # =>This Inner Loop Header: Depth=1
mov byte ptr [rsp + 8], 0
movzx eax, byte ptr [rsp]
movzx eax, byte ptr [rsp + 1]
movzx eax, byte ptr [rsp + 2]
movzx eax, byte ptr [rsp + 3]
movzx eax, byte ptr [rsp + 4]
movzx eax, byte ptr [rsp + 5]
movzx eax, byte ptr [rsp + 6]
movzx eax, byte ptr [rsp + 7]
movzx eax, byte ptr [rsp + 9]
movzx eax, byte ptr [rsp + 10]
movzx eax, byte ptr [rsp + 11]
movzx eax, byte ptr [rsp + 8]
jmp .LBB0_1
# -- End function

playground::ready_bench: # @playground::ready_bench

%bb.0:

push rax
call playground::iter
ud2
# -- End function

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rust/issues/70488#issuecomment-605698717,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJO2TUM5ALCGWNW2ABK3WTRJ6XQFANCNFSM4LVMOP5Q
.

goffrie on 29 Mar 2020

The actual LLVM type did change to have an alignment of 1:

; stable
%"std::future::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [0 x i32], i32, [2 x i32] }

; nightly
%"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [8 x i8], i8, [3 x i8] }

One odd thing is that the volatile load in the LLVM IR is passed align 8, so I'm wondering why that isn't used to make the load more efficient here.

; playground::iter
; Function Attrs: noinline noreturn nounwind nonlazybind uwtable
define internal fastcc void @_ZN10playground4iter17h1130617e37f08519E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %_3 = alloca %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", align 8
  %_3.0.sroa_cast = bitcast %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3 to i8*
  %_3.8.sroa_idx = getelementptr inbounds %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, i64 0, i32 1, i32 1
  br label %bb1

bb1:                                              ; preds = %bb1, %start
  call void @llvm.lifetime.start.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
  store i8 0, i8* %_3.8.sroa_idx, align 8, !alias.scope !2
  %_3.0. = load volatile %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, align 8, !alias.scope !8, !noalias !11
  call void @llvm.lifetime.end.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
  br label %bb1
}

jonas-schievink on 29 Mar 2020

@jonas-schievink

Isn't x86 generally less efficient when dealing with <32 bit words?

Typically, CPUs are fastest at operating on integers of their native integer size. On x86, 32-bit arithmetics are can be twice as fast compare to 16-bit operands.

Cf: http://www.idryman.org/blog/2012/11/21/integer-promotion/

In C this potential cause of performance regression cannot happen because of Integer promotion

If so, couldn't rust support integer promotion? At the very least as an optional flag in order to enforce performance garanties?
In the context of this issue, toggling on this flag would allow to test the hypothesis and instantaneously resolve or not the root cause of the performance regression.

LifeIsStrange on 30 Mar 2020

👎1

The PR that caused this intentionally did the opposite of integer promotion (narrowing a u32 to a u8), just not in a way that's visible to users. Generally rustc/LLVM is free to do integer promotion or narrowing whenever it sees fit, as long as it doesn't change the observed behavior of the program.

I haven't completely nailed down this issue yet as I'm not an LLVM expert, but it seems like it's caused by LLVM using only the LLVM type for alignment decisions, not the explicit alignment of the load (Presumably this doesn't always happen for loads or it would have been caught earlier? Are only volatile loads affected?).

jonas-schievink on 31 Mar 2020

👍1

I have no idea if this is really relevant to the conversation but LLVM is introducing a new alignment type that could solve some issues http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html

LifeIsStrange on 31 Mar 2020

We discussed this in the wg-async-foundations meeting and several people wanted to close as won't fix. Without evidence that this causes performance regressions in real-world code, it may not be worth pursuing further. We also don't think it's worth making generators bigger or changing their alignment to improve a microbenchmark.

That said, it does seem like LLVM could be generating better assembly, so pinging those folks for more input:

@rustbot ping llvm

tmandry on 31 Mar 2020

Hey LLVM ICE-breakers! This bug has been identified as a good
"LLVM ICE-breaking candidate". In case it's useful, here are some
[instructions] for tackling these sorts of bugs. Maybe take a look?
Thanks! <3

cc @comex @cuviper @DutchGhost @hanna-kruppe @hdhoang @heyrutvik @JOE1994 @jryans @mmilenko @nagisa @nikic @Noah-Kennedy @SiavoshZarrasvand @spastorino @vertexclique @vgxbj

rustbot on 31 Mar 2020

Longer explanation:

comex on 1 Apr 2020

👍6

LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions.

Ah, I see, that explains the emitted code, thanks! It also explains what was observed in https://github.com/rust-lang/rust/issues/66595 I suppose.

Well, I don't think there's much of a point in leaving this issue open then, so closing.

jonas-schievink on 1 Apr 2020

👍1

To sum up.
Am I right than performance regression happens only when all conditions are met?
1) Custom black_box, based on volatile (instead of std::hint::black_box)
2) Future is created, but neither polled not awaited.
3) Future is tiny.