The test crate is not stable, and I鈥檓 not aware of plans to stabilize it soon. It鈥檚 probably better to spend some time experimenting on crates.io using harness = false in Cargo.toml (see https://github.com/rust-lang/cargo/issues/2305).
I鈥檝e extracted the test crate to eventually publish it separately and am removing usage of unstable features one by one. One of them is asm, used in:
/// A function that is opaque to the optimizer, to allow benchmarks to
/// pretend to use outputs to assist in avoiding dead-code
/// elimination.
///
/// This function is a no-op, and does not even read from `dummy`.
#[cfg(not(all(target_os = "nacl", target_arch = "le32")))]
pub fn black_box<T>(dummy: T) -> T {
// we need to "use" the argument in some way LLVM can't
// introspect.
unsafe { asm!("" : : "r"(&dummy)) }
dummy
}
#[cfg(all(target_os = "nacl", target_arch = "le32"))]
#[inline(never)]
pub fn black_box<T>(dummy: T) -> T {
dummy
}
It鈥檚 an important part of benchmarking. Since asm! is not stable and also unlikely to be stable soon, I鈥檇 like to have black_box stabilized in std, as a building block for external test and benchmarking harnesses.
I鈥檓 not sure what module it would go into. Maybe std::time?
std::mem might make more sense as a location, but this seems like a reasonable thing to do regardless.
I like the idea too, especially if it lets folks experiment with benchmarking in an external crate. std::mem sounds betterish than std::time I think.
Is the documentation of black_box presented here something we can guarantee? What happens on nacl?
Is the documentation of black_box presented here something we can guarantee?
Probably not. E.g. a quick scan of the LLVM inline-asm docs implies it doesn't seem to say that it won't ever do any optimisations.
What happens on nacl?
I imagine black_box has absolutely no effect. I guess something like the following might work better (at some cost), although it will likely fail with LTO.
#[inline(never)]
fn not_cross_crate(x: *mut u8) {}
pub fn black_box<T>(dummy: T) -> T {
not_cross_crate(&dummy as *const _ as *mut u8);
dummy
}
I don't think that we should stabilize black_box as it's currently "just a hack that appears to work". We don't really have any guarantees on behalf of LLVM nor do we really know what it does, it just emperically seems to not optimize things away. A solution like @huonw mentioned of a non-cross-crate function (e.g. something that forces codegen) would unfortunately have runtime overhead (which in theory this doesn't).
Should we file an issue on LLVM about them providing some kind of black box with this guarantee? Between benchmark and timing attacks it鈥檚 not like we鈥檙e the only ones dealing with this problem.
Perhaps, yeah, but it may not be too productive to do so without clearly figuring out what we want to actually do here. For example:
The point of black_box is to behave as an "optimization barrier" for a variable. It should do two things:
black_box, it should force the computation of its argument. The optimizer has to assume black_box depends on its argument and anything it might possibly point to;black_box, it should assume nothing about its return value. The optimizer has to assume black_box might have produced its return value from thin air. In particular, even if the argument to black_box was &T, the optimizer has to assume black_box might have returned a different &T, pointing to different contents.Since it's a compiler barrier, a good place to look is the Linux kernel. What we want is a mixture of barrier_data and OPTIMIZER_HIDE_VAR. That is, what we want is the equivalent of the following inline assembly statement in GCC's syntax (Rust's syntax is slightly different):
__asm__ __volatile__ ("" : "=r" (var) : "0" (var) : "memory")
This statement produces no assembly (and the input and output registers are the same), but the optimizer is forced to assume:
&mut T or similar) might unpredictably modify whatever it points to.It's not a no-op, and probably a naive idea, but would copying the value with ptr::read_volatile() and/or ptr::write_volatile() have the intended effect?
It definitely prevented the elision of memset() for zeroing a buffer on-drop in this case, though I used it to read the first byte in the buffer instead of the Vec object on-stack.
I would expect reading/rewriting the first byte of the value on-stack to be safe and of negligible cost while inducing the desired effect. It can be a no-op for ZSTs since they don't have any state to make assumptions of.
Addendum: this seems to produce the desired effect of preventing elision of memset(), even without #[inline(never)]: https://is.gd/T6Vbb4
Version that covers ZSTs, branch is optimized out: https://is.gd/ApKyY7
@abonander Just for exploration and for future reference, the read_volatile thing is not enough.
use std::ptr;
fn main() {
let buf = (Vec::<u8>::with_capacity(128), Vec::<u8>::with_capacity(64));
black_box(buf);
}
fn black_box<T>(val: T) -> T {
let _ = unsafe { ptr::read_volatile(&val as *const T as *const u8) };
val
}
The allocation of the second vector will be optimized out, so it's using detailed information that only the first element of the tuple is needed. Using test::black_box, none of the vecs are optimized out, like we want.
@bluss Copying the whole value, however, does prevent the elision of the second allocation: https://is.gd/6Oc0xR
This also doesn't need to be special-cased. However, because it does force a copy of the full value it can be slow for larger types.
@abonander Thanks, that does work well so far
Standardizing a black_box function with the behaviour @cesarb described would be very useful in a bunch of different contexts beyond just benchmarking, and it's much smaller than, say, stabilizing inline asm.
As an alternative, what happens if we pass a pointer to the value to an extern "C" function defined somewhere else, maybe in the stdlib or in some C code? LLVM can't introspect that, right, and must assume it's going to be used? But I'm guessing LTO can introspect that and optimize it out unless it's in a dylib, but LTO would probably also optimize out the current impl so I don't know.
One workaround for the LTO issue is to use the same inline asm trick (inside the .c file), for instance https://github.com/cesarb/clear_on_drop/blob/master/src/hide.c
Since the LLVM clobbers-in-asm-block behaviour is what's being relied on anyways, it seems cleaner and simpler for Rust to provide a black_box function instead of wrapping the inline asm in a .c file.
I did that on my clear_on_drop experiment. The trick is to use the inline asm inside the C function, since inline asm is already stable on both gcc and clang:
https://github.com/cesarb/clear_on_drop/blob/master/src/hide.rs
https://github.com/cesarb/clear_on_drop/blob/master/src/hide.c
(Pay no attention to the "fallback" attempt using atomics in that file, I've recently found out that the optimizer can still see through it. Only the inline asm and the external C function are guaranteed to work. Also, the external C funcion was supposed to be using c_void, as can be seen going back a few revisions, but unfortunately c_void is missing on no_std).
@alexcrichton has already summed up the issue with the inline-asm trick above:
[...] it's currently "just a hack that appears to work". We don't really have any guarantees on behalf of LLVM nor do we really know what it does, it just emperically seems to not optimize things away.
I guess we could just stabilize it with a codegen test that ensures the semantics we want to guarantee?
As an additional datapoint, it appears a volatile copy of a pointer to the value also prevents optimization: https://play.rust-lang.org/?gist=953e86d736b4225ddb1d1548c679cf65&version=nightly
It's not a no-op but it's O(1) to the size of T which is effectively the same. I don't know if this covers every angle of black_box but it has the additional benefit of working as-is on stable. I guess it forces LLVM to assume that the pointer might have been sent over shared memory or something.
A volatile read of the pointer appears to have the same effect (double-&) and also looks a little cleaner: https://play.rust-lang.org/?gist=4890326faa607b8185db20e714f52986&version=nightly
One can reason about the inline asm trick, as long as LLVM obeys the following basic premise: "the inline assembly instructions are opaque". That is, the compiler and optimizer know nothing about the inline assembly instruction(s), other than what is explicit in the template (inputs, outputs, clobbers).
Of course, that means that the precise input and output specifications are very important, since all the guarantees are in them. For instance, the assembly in the first comment above:
asm!("" : : "r"(&dummy))
It puts the address of dummy in a register, therefore it's guaranteed that this address is computed before that point. I'm not enough of an inline assembly wizard to be sure what this means about the contents of that address, however; I don't think this forces dummy to be computed before this point.
And since that template has no outputs, it doesn't guarantee anything about the return value. Even if it's computed before the inline asm, it might get computed again, or even inlined.
Now compare with my example above (hide.rs):
asm!("" : "=*m" (ptr) : "*0" (ptr));
The value "flows through" the inline asm, so it can't be inlined; I'm also passing the whole value to the inline assembly ("*m"), instead of just its address ("r"), so the compiler would be forced to compute it before the inline asm, and also to assume that the inline asm has modified it.
So yes, the current implementation might be "just a hack that appears to work", but with some effort choosing the correct modifiers, that doesn't have to be the case (modulo LLVM bugs, of course).
@cesarb nitpick: in Rust you want &mut T because, in the absence of UnsafeCell, we guarantee that &T cannot be mutated through, and that could be taken advantage of later.
While it does a little extra stuff when invoked in isolation (spilling %rax to the stack and then overwriting it), the volatile-ops-on-pointer approach produces assembly that is near-indistinguishable from the inline-asm approach when inlined.
I think that whether Rust provides a black_box function with defined semantics is much more important than the details of how it happens to be implemented, whether using an inline asm block or any other method.
The black_box function can be a black box; what's important is that it's in one place and has defined semantics that people can rely on (i.e., "it is a bug if this optimization barrier stops working").
I don't know if it helps much but read_volatile and write_volatile seem to have relatively well defined semantics, at least if you consider "whatever C11 specifies" to mean 'defined'. Inline asm seems to be more like "whatever LLVM does, specified or not".
I have a PR for the Benchmarking RFC https://github.com/Manishearth/rfcs/pull/1 which adds:
mem::black_box(x): x is used by "something" with side-effects and as a consequence cannot be optimized away. The expression producing x can be optimized by the compiler.
mem::clobber(): at the clobber all memory is ready by "something" with side-effects, and thus any memory modifications by previous code must have been committed before the clobber. This forces the compiler to flush all pending writes to memory that might otherwise have been optimized away.
The PR contains an example of how to use these to benchmark Vec::push.
@gnzlbg mem::clobber reminds me of sync::atomic::compiler_fence, although I'm not sure what interactions it has with the situations you're interested in (or if it's even possible to flush all writes).
@eddyb @kennytm mentioned the same thing in the PR comments. I answered here: https://github.com/Manishearth/rfcs/pull/1#issuecomment-362226760
IIUC (which I am not sure I do), compiler_fence synchronizes reads/writes across multiple threads by preventing reordering instructions across "fences" of different strengths depending on the Ordering. If the compiler can prove that no other thread reads the result of a write, it still can optimize a write away, what it can't do is reorder some statements across the fence, for example.
With clobber the current thread does a volatile read of all memory and performs a side-effecting operation based on it. The compiler can still reorder across the clobber, but all pending writes must execute before it. Also, because this all happens within a single thread no memory fences / synchronization instructions need to be emitted (e.g. data-races can't happen). Does this make sense?
But LLVM can still reorder writes to e.g. stack variables, based on escape/alias analysis, right?
Since they can never alias the volatile read without UB.
But LLVM can still reorder writes to e.g. stack variables, based on escape/alias analysis, right?
Right. The PR to the RFC precises a bit more what I meant with clobbering memory:
Clobber is specified to guarantee that writes through pointers that have previously been black_boxed are flushed to memory. So for example in the case of benchmarking Vec::push, we first need to black_box Vec's pointer and then after pushing we flush the writes through that pointer:
fn bench_vec_push_back(bench: Bencher) -> BenchResult {
let n = 100_000_000;
let mut v = Vec::with_capacity(n);
bench.iter_n(n, || {
// Allow vector data to be clobbered:
mem::black_box(v.as_ptr());
v.push_back(42_u8);
// Forces writes through `v.as_ptr()` to be written back to memory,
// that is, 42 is written back to memory:
mem::clobber();
})
}
The Google Benchmark library that these functions are based on specify them exactly like this and the docs particularly insist that clobber only works for variables that have previously been black_boxed (they call their black_box function DoNotOptimize).
Clobber is specified to guarantee that writes through pointers that have previously been black_boxed are flushed to memory.
Ah, this makes a lot more sense, thanks!
Also, because this all happens within a single thread no memory fences / synchronization instructions need to be emitted (e.g. data-races can't happen).
I linked compiler_fence and not fence specifically because of this.
Oh and I think black_box on the pointer means LLVM has to assume the pointer might have escaped to another thread, so I'm not sure we need additional machinery here.
I think that the difference between black_box and clobber is confusing and very error-prone. Perhaps the name black_box is itself to vague and a better name should be chosen?
IMO, the crux of the problem is that an entire computation may be omitted because of optimisation guarantees. So, IMO, there are two possible functions to aid this:
/// The compiler forces `x` to have a known value. This prevents optimisations across the boundaries of this function, although this does not guarantee that `x` is evaluated if the result is ultimately unused.
fn value_fence<T>(X: T) -> T;
/// The compiler forces `x` to have a known value, ensuring that all the computations required to do so are executed. After this, the value is dropped.
fn evaluate_and_drop<T>(X: T);
Note that the former is basically a read_volatile and forget, while the latter appears to be a write_volatile and forget.
To clarify exactly:
fn value_fence<T>(x: T) -> T {
let y = unsafe { (&x as *const T).read_volatile() };
std::mem::forget(x);
y
}
fn evaluate_and_drop<T>(x: T) {
let y = unsafe { std::mem::uninitialised() };
unsafe { (&y as *const T).write_volatile(x); }
drop(y); // not necessary but for clarity
}
I misremembered the API for write_volatile and no forget is required.
@clarcharr Could you show how to benchmark Vec::push with them? That would help me understand them better. I posted the Vec::push example here: https://github.com/rust-lang/rfcs/issues/1484#issuecomment-362244927
Is it possible to use the inline asm approach, which is a no-op, instead of a volatile read, which isn't?
@gnzlbg I'd probably do a vec.push(); evaluate_and_drop(vec.last())
@hdevalence I don't think that any approach will really be a no-op, but considering how the value will already be in cache when the function runs, performance penalty will only be a few cycles max.
So I've tried @clarcharr suggestion here: https://godbolt.org/g/HvZJZW and the three alternatives (test::black_box, clobber, and ptr::write_volatile) all produce slightly different code:
test::black_box: 28 asm instructionsptr::write_volatile: 27 asm instructionsclobber: 26 asm instructionsMaybe someone with more asm experience than me could look at their disassemblies (shown in godbolt) and comment on what they are exactly doing. Running these through cargo bench, they all report the exact same timings for Vec::push (in my machine 3ns per iter +/- 0).
I've let this settle a bit, but I wanted to add two things:
First, ptr::{read,write}_volatile are already available in std, that is, one can already use them when writing benchmarks. The benchmarking docs do not mention that these might be useful for this purpose, but in my opinion that is a documentation issue.
The purpose of test::black_box (and similar functions, like clobber) is to disable compiler optimizations. I think that functions with this purpose should not generate any code.
That would rule out value_fence and evaluate_and_drop, at least with their proposed implementation, but both test::black_box, and google_benchmark::{black_box, clobber} pass this litmus test.
So I think it would be important to agree if this is a property that we want to preserve. Should functions whose intent is to just disable compiler optimizations be allowed to generate any code when, for example, used in isolation?
As mentioned, I think that ruling out value_fence and evaluate_and_drop is not a downside, because they can be implemented in stable rust already without issues. OTOH the other functions being discussed require inline assembly tailored to LLVM, so we would currently need to expose them via std to make them available on stable and also to make sure that they work if we ever get another backend like Cranelift (Google Benchmark has implementations for MSVC, GCC, and other compilers, and these are tailored to the inline assembly functionality of each compiler, which support different features).
I'm still a proponent of using ptr::write_volatile(); the version that performs the volatile write on a pointer to the data is O(1) to the size of the data itself, meaning its overhead is constant across all invocations and thus can be factored out of performance comparisons. I haven't done enough testing to be sure that it guarantees everything we want out of black_box() but it seems like not too bad of a place to start.
I'm still a proponent of using ptr::write_volatile(); the version that performs the volatile write on a pointer to the data is O(1) to the size of the data itself, meaning its overhead is constant across all invocations and thus can be factored out of performance comparisons.
I agree with this statement. For completeness it is worth mentioning that a consequence of doing this is that the code being measured in the benchmarks using ptr::write_volatile() is not the same code that will run in your application. The assembly tests I showed above seem to corroborate this statement.
The google benchmark intrinsics intent is to _allow_ benchmarking the exact same code that the compiler would have emitted in your real application. They might or might not succeed at this (this is hard, and it is not only up to them, but also up to the benchmark writer who must be extra careful), but that is at least what they attempt to enable doing. Since doing this is hard, this might not be the only tools that people will have while writing benchmarks though. One cool thing about test::black_box is that is extremely simple to use.
I've opened a PR for this here: #2360
Closing in favor of https://github.com/rust-lang/rfcs/pull/2360.
Most helpful comment
The point of
black_boxis to behave as an "optimization barrier" for a variable. It should do two things:black_box, it should force the computation of its argument. The optimizer has to assumeblack_boxdepends on its argument and anything it might possibly point to;black_box, it should assume nothing about its return value. The optimizer has to assumeblack_boxmight have produced its return value from thin air. In particular, even if the argument toblack_boxwas&T, the optimizer has to assumeblack_boxmight have returned a different&T, pointing to different contents.Since it's a compiler barrier, a good place to look is the Linux kernel. What we want is a mixture of
barrier_dataandOPTIMIZER_HIDE_VAR. That is, what we want is the equivalent of the following inline assembly statement in GCC's syntax (Rust's syntax is slightly different):This statement produces no assembly (and the input and output registers are the same), but the optimizer is forced to assume:
&mut Tor similar) might unpredictably modify whatever it points to.