See https://github.com/rust-lang/rust/pull/69033#issuecomment-605357808; @kpp reports a 200%+ regression on some futures microbenchmarks, e.g.:
future::ready/async_combinators
time: [1.5699 ns 1.5743 ns 1.5793 ns]
change: [+279.32% +282.34% +285.20%] (p = 0.00 < 0.05)
Performance has regressed.
The benchmarks are at https://github.com/kpp/futures-async-combinators/.
I bisected this to 57e1da59cd0761330b4ea8d47b16340a78eeafa9, which most likely implicates #69837.
Can reproduce. The regression comes from changing the discriminant type from u32 to u8 (or whatever fits best, which almost always is u8), exposing the niche does not regress performance much.
Isn't x86 generally less efficient when dealing with <32 bit words? In that case we might want to consider doing this optimization conditionally, since it can be a somewhat significant RAM saving on smaller microcontroller targets. I could imagine using -C opt-level=s
to decide this, but I don't think there's any precedence for doing something like that at the moment.
I think the slowdown is likely some sort of bug in our codegen or LLVM perhaps?
This is a single iteration on nightly (unrolled 8 times in the actual benchmark code):
0.00 : 18a260: mov BYTE PTR [rsp+0x8],0x0
0.80 : 18a265: movzx ecx,BYTE PTR [rsp]
0.99 : 18a269: movzx ecx,BYTE PTR [rsp+0x1]
1.09 : 18a26e: movzx ecx,BYTE PTR [rsp+0x2]
0.87 : 18a273: movzx ecx,BYTE PTR [rsp+0x3]
0.97 : 18a278: movzx ecx,BYTE PTR [rsp+0x4]
0.83 : 18a27d: movzx ecx,BYTE PTR [rsp+0x5]
1.32 : 18a282: movzx ecx,BYTE PTR [rsp+0x6]
0.87 : 18a287: movzx ecx,BYTE PTR [rsp+0x7]
0.84 : 18a28c: movzx ecx,BYTE PTR [rsp+0x9]
0.86 : 18a291: movzx ecx,BYTE PTR [rsp+0xa]
1.01 : 18a296: movzx ecx,BYTE PTR [rsp+0xb]
1.04 : 18a29b: movzx ecx,BYTE PTR [rsp+0x8]
On the good commit, it looks like this (and is only unrolled 4 times):
17.76 : 183f62: mov DWORD PTR [rsp],0x0
0.00 : 183f69: mov edx,DWORD PTR [rsp+0x4]
3.26 : 183f6d: mov edx,DWORD PTR [rsp+0x8]
0.00 : 183f71: mov edx,DWORD PTR [rsp]
Trying to reduce this down, I came up with this example:
use std::future::Future;
pub async fn foo(v: i32) -> i32 { v }
macro_rules! wrap {
($v:expr) => {
async move { ($v).await }
}
}
pub fn bar(v: i32) -> impl Future<Output = i32> {
wrap!(wrap!(wrap!(wrap!(wrap!(wrap!(foo(v)))))))
}
In godbolt, the nightly has an offset of 52 -- and that increases by 8 each additional wrap!
-- whereas stable stays constant at 4, the size of i32 (presumably). I think that's the bug here, though I'm not sure how the discriminant size managed to affect that. Maybe previously we were getting split into two separate LLVM "words" via SROA or something and now we aren't since we fit into a single register or something like that? That could I guess plausibly lead to problems here; my assumption is that the benchmark is loading many more words from the stack on nightly because the "size" of the future has increased drastically? OTOH, size_of_val etc do not report a size increase -- so why the offset is increasing I'm not sure. But I suspect it's for the same reason we're seeing more stack loads.
Nightly:
example::foo:
mov eax, edi
ret
example::bar:
mov rax, rdi
mov dword ptr [rdi], esi
mov byte ptr [rdi + 52], 0
ret
and on stable:
example::foo:
mov eax, edi
ret
example::bar:
mov rax, rdi
mov dword ptr [rdi], esi
mov dword ptr [rdi + 4], 0
ret
Thanks, that's helpful!
The dword vs. byte moves can be explained by the alignment changing from 4 (due to using u32
as the discriminant) to 1 (since we now use u8
as the discriminant), as long as nothing with a larger alignment is contained.
In godbolt, the nightly has an offset of 52 -- and that increases by 8 each additional
wrap!
-- whereas stable stays constant at 4, the size of i32 (presumably).
Not entirely certain, but I think this is just automatic field reordering doing its thing. The discriminant is now a u8 with alignment 1, so it gets moved to the very end of the memory layout to not waste space with padding. Previously it was a u32 with alignment 4, just like all the other fields in the generator, so reordering anything there doesn't make sense.
If I take your example and replace i32
with bool
, the size of foo
goes from 8 to 2 bytes (and its alignment from 4 to 1), while bar
goes from 56 to 14 bytes (its alignment also goes from 4 to 1).
So far this looks like a fairly expected outcome of my PR; it's pretty well-known that smaller alignment can generate less efficient code (OTOH apparently modern x86 chips can do unaligned moves just as fast as aligned ones, so maybe the byte-wise moves should not be emitted, but in the end this is completely up to LLVM). Since, AFAICT, it should only affect futures that do not store any data with alignment in them, the impact should be pretty minimal in practice, so we can also consider closing this without a fix.
(this does make me wish that we had better layout debugging capabilities in rustc though)
this does make me wish that we had better layout debugging capabilities in rustc
A #[rustc_layout(debug)]
attribute was added in #69901 (edit: heh, just noticed you're the first comment on it ^^). With the help of type_alias_impl_trait
, rustc output some layout information for bar
's future: playground
Ah, nice to see that land, that's certainly a step in the right direction
@jonas-schievink a quick test to see if this is indeed an alignment problem is to stick a #[repr(align(4))]
-equivalent onto the generator.
That does not seem to do anything, hmm... Maybe I'm doing it wrong though. Would be useful to get an MCVE that exhibits the inefficient codegen.
I've just checked the benchmark code, and it does not actually benchmark anything except copying the futures around. They are never polled because they are never spawned onto an executor.
The benchmark results would make sense here if the alignment is really the cause.
How so?
executor::block_on(async {
let mut group = c.benchmark_group("future::ready");
group.bench_function("futures", |b| {
b.iter(move || async {
black_box(futures::future::ready(42)).await
})
});
We got block_on
and .await
.
Enabling Criterion's real_blackbox
feature significantly speeds up the future::ready
benchmarks. The default implementation does more copying.
With the feature enabled the benchmarks now take <1ns, so yeah, they were really just benchmarking the memcpy in black_box
.
@kpp block_on
only blocks on the outer async
block, which does not await anything. The futures created by the closure passed to iter
are just dropped without being spawned or awaited (which is what is actually getting benchmarked here). You can test this by putting a panic!()
inside the inner async
blocks. It is never reached.
Ooops. Never felt more embarrassed.
Jonas, does this sounds better to you?
fn bench_ready(c: &mut Criterion) {
let mut group = c.benchmark_group("future::ready");
group.bench_function("futures", |b| {
b.iter(|| {
executor::block_on(async {
black_box(futures::future::ready(42)).await;
})
})
});
I'd wrap the block_on
in a black_box
to make sure the entire thing gets evaluated (and not just the expression producing the future)
black_box(futures::future::ready(42).await);
?
use std::{mem, ptr};
use std::future::Future;
async fn ready<T>(t: T) -> T {
t
}
fn black_box<T>(dummy: T) -> T {
unsafe {
let ret = std::ptr::read_volatile(&dummy);
std::mem::forget(dummy);
ret
}
}
#[inline(never)]
fn iter<O, R>(mut routine: R)
where
R: FnMut() -> O,
{
loop {
black_box(routine());
}
}
pub unsafe fn ready_bench() {
iter(move || async {
black_box(ready(42)).await
});
}
On stable:
playground::iter:
sub rsp, 16
.LBB0_1:
mov dword ptr [rsp], 0
mov eax, dword ptr [rsp + 4]
mov eax, dword ptr [rsp + 8]
mov eax, dword ptr [rsp]
jmp .LBB0_1
playground::ready_bench:
push rax
call playground::iter
ud2
On nightly:
playground::iter: # @playground::iter
# %bb.0:
sub rsp, 16
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov byte ptr [rsp + 8], 0
movzx eax, byte ptr [rsp]
movzx eax, byte ptr [rsp + 1]
movzx eax, byte ptr [rsp + 2]
movzx eax, byte ptr [rsp + 3]
movzx eax, byte ptr [rsp + 4]
movzx eax, byte ptr [rsp + 5]
movzx eax, byte ptr [rsp + 6]
movzx eax, byte ptr [rsp + 7]
movzx eax, byte ptr [rsp + 9]
movzx eax, byte ptr [rsp + 10]
movzx eax, byte ptr [rsp + 11]
movzx eax, byte ptr [rsp + 8]
jmp .LBB0_1
# -- End function
playground::ready_bench: # @playground::ready_bench
# %bb.0:
push rax
call playground::iter
ud2
# -- End function
Hm, I guess this is not really a bug then.
On Sun., Mar. 29, 2020, 13:45 Jonas Schievink, notifications@github.com
wrote:
Reproduction
https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=fc0db5e71c27b16c010db95c9053b6ac
:use std::{mem, ptr};
use std::future::Future;
async fn ready(t: T) -> T {
t
}
fn black_box(dummy: T) -> T {
unsafe {
let ret = std::ptr::read_volatile(&dummy);
std::mem::forget(dummy);
ret
}
}[inline(never)]fn iter
(mut routine: R)where R: FnMut() -> O,
{
loop {
black_box(routine());
}
}
pub unsafe fn ready_bench() {
iter(move || async {
black_box(ready(42)).await
});
}On stable:
playground::iter:
sub rsp, 16.LBB0_1:
mov dword ptr [rsp], 0
mov eax, dword ptr [rsp + 4]
mov eax, dword ptr [rsp + 8]
mov eax, dword ptr [rsp]
jmp .LBB0_1playground::ready_bench:
push rax
call playground::iter
ud2On nightly:
playground::iter: # @playground::iter
%bb.0:
sub rsp, 16
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov byte ptr [rsp + 8], 0
movzx eax, byte ptr [rsp]
movzx eax, byte ptr [rsp + 1]
movzx eax, byte ptr [rsp + 2]
movzx eax, byte ptr [rsp + 3]
movzx eax, byte ptr [rsp + 4]
movzx eax, byte ptr [rsp + 5]
movzx eax, byte ptr [rsp + 6]
movzx eax, byte ptr [rsp + 7]
movzx eax, byte ptr [rsp + 9]
movzx eax, byte ptr [rsp + 10]
movzx eax, byte ptr [rsp + 11]
movzx eax, byte ptr [rsp + 8]
jmp .LBB0_1
# -- End functionplayground::ready_bench: # @playground::ready_bench
%bb.0:
push rax
call playground::iter
ud2
# -- End function—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rust/issues/70488#issuecomment-605698717,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJO2TUM5ALCGWNW2ABK3WTRJ6XQFANCNFSM4LVMOP5Q
.
The actual LLVM type did change to have an alignment of 1:
; stable
%"std::future::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [0 x i32], i32, [2 x i32] }
; nightly
%"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [8 x i8], i8, [3 x i8] }
One odd thing is that the volatile load in the LLVM IR is passed align 8
, so I'm wondering why that isn't used to make the load more efficient here.
; playground::iter
; Function Attrs: noinline noreturn nounwind nonlazybind uwtable
define internal fastcc void @_ZN10playground4iter17h1130617e37f08519E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
%_3 = alloca %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", align 8
%_3.0.sroa_cast = bitcast %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3 to i8*
%_3.8.sroa_idx = getelementptr inbounds %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, i64 0, i32 1, i32 1
br label %bb1
bb1: ; preds = %bb1, %start
call void @llvm.lifetime.start.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
store i8 0, i8* %_3.8.sroa_idx, align 8, !alias.scope !2
%_3.0. = load volatile %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, align 8, !alias.scope !8, !noalias !11
call void @llvm.lifetime.end.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
br label %bb1
}
@jonas-schievink
Isn't x86 generally less efficient when dealing with <32 bit words?
Typically, CPUs are fastest at operating on integers of their native integer size. On x86, 32-bit arithmetics are can be twice as fast compare to 16-bit operands.
Cf: http://www.idryman.org/blog/2012/11/21/integer-promotion/
In C this potential cause of performance regression cannot happen because of Integer promotion
If so, couldn't rust support integer promotion? At the very least as an optional flag in order to enforce performance garanties?
In the context of this issue, toggling on this flag would allow to test the hypothesis and instantaneously resolve or not the root cause of the performance regression.
The PR that caused this intentionally did the opposite of integer promotion (narrowing a u32 to a u8), just not in a way that's visible to users. Generally rustc/LLVM is free to do integer promotion or narrowing whenever it sees fit, as long as it doesn't change the observed behavior of the program.
I haven't completely nailed down this issue yet as I'm not an LLVM expert, but it seems like it's caused by LLVM using only the LLVM type for alignment decisions, not the explicit alignment of the load
(Presumably this doesn't always happen for loads or it would have been caught earlier? Are only volatile loads affected?).
I have no idea if this is really relevant to the conversation but LLVM is introducing a new alignment type that could solve some issues http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
We discussed this in the wg-async-foundations meeting and several people wanted to close as won't fix. Without evidence that this causes performance regressions in real-world code, it may not be worth pursuing further. We also don't think it's worth making generators bigger or changing their alignment to improve a microbenchmark.
That said, it does seem like LLVM could be generating better assembly, so pinging those folks for more input:
@rustbot ping llvm
Hey LLVM ICE-breakers! This bug has been identified as a good
"LLVM ICE-breaking candidate". In case it's useful, here are some
[instructions] for tackling these sorts of bugs. Maybe take a look?
Thanks! <3
cc @comex @cuviper @DutchGhost @hanna-kruppe @hdhoang @heyrutvik @JOE1994 @jryans @mmilenko @nagisa @nikic @Noah-Kennedy @SiavoshZarrasvand @spastorino @vertexclique @vgxbj
For the "reproduction", the issue isn't the alignment; it's the way the example reimplements black_box
in terms of a volatile load of an entire struct, as an attempt to make it work on stable. The actual std::hint::black_box
is not implemented this way; if you change the example to use it, it will only work on nightly, but it doesn't show the wasteful assembly.
Longer explanation:
LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions. (Without volatile it will optimize them into something nicer.) Note that LLVM does this regardless of the size of struct, as long as the LLVM IR has a load of struct type (load volatile %AlignedBytes, %AlignedBytes* %src
). But if the struct is small enough (say, 8 u8
s), then rustc will instead produce LLVM IR containing a load of integer type (load volatile i64, i64* %1
), so you get a single machine load instruction even though it's volatile. Interestingly, Clang does the same thing for C code. I wonder why.
Regardless, the result is unpredictable, nonintuitive, and undocumented, in both Rust and C. Luckily, nobody actually uses volatile loads/stores of entire structs in cases where it matters, like accessing hardware MMIO registers: the registers may or may not be represented in a struct, but the actual loads and stores are done to individual fields.
LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions.
Ah, I see, that explains the emitted code, thanks! It also explains what was observed in https://github.com/rust-lang/rust/issues/66595 I suppose.
Well, I don't think there's much of a point in leaving this issue open then, so closing.
To sum up.
Am I right than performance regression happens only when all conditions are met?
1) Custom black_box
, based on volatile (instead of std::hint::black_box
)
2) Future is created, but neither polled not awaited.
3) Future is tiny.
Most helpful comment
For the "reproduction", the issue isn't the alignment; it's the way the example reimplements
black_box
in terms of a volatile load of an entire struct, as an attempt to make it work on stable. The actualstd::hint::black_box
is not implemented this way; if you change the example to use it, it will only work on nightly, but it doesn't show the wasteful assembly.Longer explanation:
LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions. (Without volatile it will optimize them into something nicer.) Note that LLVM does this regardless of the size of struct, as long as the LLVM IR has a load of struct type (
load volatile %AlignedBytes, %AlignedBytes* %src
). But if the struct is small enough (say, 8u8
s), then rustc will instead produce LLVM IR containing a load of integer type (load volatile i64, i64* %1
), so you get a single machine load instruction even though it's volatile. Interestingly, Clang does the same thing for C code. I wonder why.Regardless, the result is unpredictable, nonintuitive, and undocumented, in both Rust and C. Luckily, nobody actually uses volatile loads/stores of entire structs in cases where it matters, like accessing hardware MMIO registers: the registers may or may not be represented in a struct, but the actual loads and stores are done to individual fields.