Rust: Large by-value Copy arguments to functions not converted to by-reference leading to performance loss

Created on 11 Jul 2018  ยท  10Comments  ยท  Source: rust-lang/rust

https://godbolt.org/g/YodFQR

I have a big Copy type:

#[derive(Clone, Copy)]
struct HugeCopyType {
    arr: [i64; 1024],
}

I'm passing it to a function by value, and the function does not modify the argument which means it should be equivalent to passing it by reference:

fn process(arg: HugeCopyType) {
    let sum = arg.arr.iter().sum::<i64>();
    println!("{}", sum);
}

I'm calling the function from a loop:

let hct = HugeCopyType {
    arr: [0; 1024],
};

for _ in 0..512 {
    process(hct);
}

I expected Rust to be smart about this and convert the argument to be passed by reference, however it seems like that is not the case and it ends up being memcpy'd over and over inside the loop (see godbolt link above).

Changing the argument to pass by reference manually gets rid of the memcpys.

Rust: 1.26.0 (stable) and also beta and nightly.

Most helpful comment

We already have a lint in clippy that tells you to change small &T arguments to T if T: Copy. We could totally add the opposite for large T and suggest taking &T. This makes sure you do get the optimization for the relevant cases without causing a lot of compiler optimization headaches.

All 10 comments

Since the function takes the argument by value it can make arbitrary mutations to the value. Passing by reference would cause those mutations to be reflected in the calling function and subsequent invocations of process. The function is being passed a pointer, just a pointer to a fresh copy of the value.

The point is, the function doesn't modify the value (it doesn't even take mut arg), but the copy is still not optimized away.

Whether the function actually mutates the argument really really shouldn't not affect its ABI. It's probably possible: we don't guarantee ABI stability for "Rust" ABI, shims can fix the problems with function pointers / vtables, and we could stick a "needs copy of this argument?" flag into metadata. But it's not desirable: it's a ton of work and really messes with incremental compilation and is just overall a huge mess.

There is a way to make the ABI the same whether or not the function mutates the value (unconditionally take pointer to the caller's original data, make a copy locally if you want to mutate it) but that has far-reaching repercussions and I don't know for sure whether it's actually correct. I also think it's not an unequivocal win: while it would avoid copies in this case, it adds extra copies compared to the status quo if the value being passed is a temporary rather than a copy of a longer-lived allocation.

There's also the issue that passing a pointer to the caller's actual local variable would be observable through the address of the parameter (consider foo(arg: HugeCopy, by_ref: &HugeCopy) and foo does ptr::eq(&arg, by_ref)). I'm not sure whether that is actually a deal breaker but it seems quite unfortunate.

I completely understand the appeal of having a clear ABI boundary between function calls, but it shouldn't get in the way of _zero-cost abstractions_, Rust's core principle. Ownership, borrowing and Copy semantics are the abstractions imposed on us by Rust, and Rust should try to keep them as "zero-cost" as possible.

Functions and function calls are abstractions too. _This is a hot path_. Normally, the function will get completely _inlined_ (doesn't that affect ABI and incremental compilation..?); in this case it's impossible or undesirable, but we should still benefit from as many compiler optimizations as possible. (I'm thinking about it more in terms of "inlining" the info whether the function needs its argument memcopied, if we can't inline the whole function.)

On another note, I really don't think huge Copy types are even a good idea. Copy _means_ a small primitive value that is so cheap to .clone() we shouldn't need to do it explicitly. An int or a reference, yes, a huge buffer that needs to be memcopied around, no. Maybe rustc/clippy could at least lint on implicit memcopies of huge Copy types, if they don't already. After all, making expensive operations _visibly expensive_ is another of Rust's principles.

I agree with the points @bugaevc makes.

The Copy docs currently say:

Generally speaking, if your type can implement Copy, it should.

Which is the advise I've been following thus far. It also makes sense to pass Copy types by value everywhere, since it's less restrictive than passing by reference in the general case, and because for small types (up to a couple of pointers in size) it can indeed lead to better performance due to not having to dereference extra pointers.

However, with the current state of affairs it looks like I have to _guess_ whether I should pass a Copy type by reference or by value. Should I pass a type containing three ints by value? Four ints? Actually, the answer depends on lots of other context like the target architecture (number of registers), number and sizes of other parameters to every given function and so on. In this case even a lint doesn't really help, which is why I'm expecting the compiler to figure out the best way of passing a value in every given situation on its own.

I don't have very detailed opinions on this (and even if I had, I don't have the time right now to write a novel), but some points briefly:

  • Zero-cost abstractions doesn't mean any source code whatsoever will result in the machine code of your wildest dreams, just that you get the operation you asked for as efficiently as that operation can be. There is a related general principle that rustc should make it easier to write more efficient code and recognize inefficient code, but this is a multi-faceted issue. For example, it means codegen should be somewhat predictable as well -- under this proposal, a small innocent-looking change to the callee's code can suddenly lead to extra multi-KB memcpys. That isn't helpful in writing high performance code either.
  • While inlining is related in that it leaks implementation details into the code generated for a caller, it's localized to one callsite rather than globally affecting the signature of the function. This means, for example, that we currently only have to recompile codegen units that contain the body of the contained function (either the one codegen unit that the function canonically lives in, or one that the body has been copied into because of an #[inline] attribute or the like). Under this proposal, all codegen units that mention the changed function at all need to be recompiled if the "needs to copy arg?" flag of a function changes, even those where inlining the function wouldn't be possible in the first place.
  • The author of a function can always guarantee predictable cheap argument passing (i.e., not dependent on internals of the function or compiler heuristics) by taking a reference, and I'd say this would be recommended practice for most functions that accept very large and read-only Copy values, regardless of the outcome of this issue.

We already have a lint in clippy that tells you to change small &T arguments to T if T: Copy. We could totally add the opposite for large T and suggest taking &T. This makes sure you do get the optimization for the relevant cases without causing a lot of compiler optimization headaches.

We already have a lint in clippy that tells you to change small &T arguments to T if T: Copy. We could totally add the opposite for large T and suggest taking &T. This makes sure you do get the optimization for the relevant cases without causing a lot of compiler optimization headaches.

That's great! How do you decide if the type is small enough, though? Since AFAIU it depends on the platform and other circumstances.

How do you decide if the type is small enough, though? Since AFAIU it depends on the platform and other circumstances.

We do check the platform :D and hope for the best. If you have any improvement suggestions we're all ears.

I'll go ahead and close this in favor of the discussed clippy lint. Feel free to reopen if you think this issue is valuable. This would need far-reaching changes to incremental compilation and how the compiler works at the ABI level though.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dtolnay picture dtolnay  ยท  3Comments

mcarton picture mcarton  ยท  3Comments

modsec picture modsec  ยท  3Comments

zhendongsu picture zhendongsu  ยท  3Comments

pedrohjordao picture pedrohjordao  ยท  3Comments