Expose LLVM's "element wise atomic memory intrinsics". In particular:
llvm.memcpy.element.unordered.atomic as atomic_element_unordered_copy_memory_nonoverlappingllvm.memmove.element.unordered.atomic as
atomic_element_unordered_copy_memoryllvm.memset.element.unordered.atomic as atomic_element_unordered_set_memoryExpose these through functions in the std::ptr module. Each function is implemented by the equivalent intrinsic or by a loop of relaxed atomic operations if the intrinsic is not available on the target platform (TODO: Given this platform-specific behavior, can this also be exposed in core::ptr?)
copy_nonoverlapping_atomic_unordered, backed by atomic_element_unordered_copy_memory_nonoverlappingcopy_atomic_unordered, backed by atomic_element_unordered_copy_memorywrite_atomic_unordered, backed by atomic_element_unordered_set_memoryPreviously discussed on the internals forum here.
Folks with the authority to approve this: Let me know if this is OK to move forward; I'd like to post it to This Week in Rust's CFP.
@alexcrichton Do you have thoughts on this? Want to make sure I get the go-ahead before I send somebody on TWiR's CFP down a dead end.
I suspect there'll be opinions on stabilization, but seems fine to add as unstable at least to me!
OK cool, thanks!
I'd like to take a stab at implementing this.
Awesome, thanks @tmccombs ! This will unblock a lot of important work :)
@alexcrichton On the subject of exposing in core::ptr: Is the ban on platform-specific logic in core a general one, or just for API-visible elements? If it's the latter, then we should be able to implement the facade in core::ptr as well.
Currently it's largely just a "we strongly discorage platform-specific APIs" in libcore, there's already platform-specific pieces to the implementation, especially some math-related pieces on MSVC
I saw there was some discussions around use cases on #59155 so I thought I'd bring up mine (and perhaps you can all cross-check me that this would be the proper solution).
I maintain a crate for zeroizing memory (for e.g. wiping cryptographic keys/plaintexts) which presently relies on undefined behavior - namely mixed volatile / non-volatile memory accesses (in conjunction with compiler and hardware fences):
https://github.com/iqlusioninc/crates/tree/develop/zeroize
I'm presently (ab)using core::ptr::write_volatile (and relying on undefined LLVM behavior / implementation details) in conjunction with core::atomic::fence.
What I'd really like is an atomic memset-style operation, i.e. llvm.memset.element.unordered.atomic, which can write repeated copies of a value atomically. The closest thing that presently exists is volatile_set_memory, however that comes with undefined behavior when volatile and non-volatile operations are mixed.
undefined behavior when volatile and non-volatile operations are mixed.
I don't know why you think that mixing volatile and non-volatile accesses is UB. I see no issue with that per se.
That said, atomic memset might still be useful.
It is explicitly documented as UB:
https://doc.rust-lang.org/core/ptr/fn.write_volatile.html
Memory accessed with
read_volatileorwrite_volatileshould not be accessed with non-volatile operations.
Here is a thread on the subject of using volatile writes for zeroing sensitive memory:
https://internals.rust-lang.org/t/volatile-and-sensitive-memory/3188
That sentences does not say "UB" or does it say "must not". The section on when this function is UB does not mention mixed accesses.
AFAIK, it mostly refers to "normal" usage of volatile, namely interaction with IO registers: you don't want to mix volatile and non-volatile accesses on those. But I do not know of any form of UB there. @rkruppe, do you?
EDIT: Opened https://github.com/rust-lang/rust/pull/60972
Here is a thread on the subject of using volatile writes for zeroing sensitive memory:
Note that that thread does not reflect our current understanding of volatile accesses and the issues around that. See https://github.com/rust-lang/unsafe-code-guidelines/issues/33 for the latest discussion. I'd love to amend that thread but it got auto-closed :(
@RalfJung interesting (and also lengthy)! I'll have to read through it later.
Discrepancies between volatile and atomic aside, there's still no stable API for volatile memset either (only the unstable core::intrinsics::volatile_set_memory). It'd be nice if there were something along the lines of core::ptr::set_volatile which provides a stable wrapper around this intrinsic.
This is also particularly useful to implement a seqlock. Currently all implementations read unprotected data on the reader side with or without volatile, which definitely depends on LLVM's implementation to synchronize with previous load-acquire semantic
More details can be read at this C++ proposal.
This is also particularly useful to implement a seqlock. Currently all implementations read unprotected data on the reader side with or without volatile, which definitely depends on LLVM's implementation to synchronize with previous load-acquire semantics.
It depends on LLVM semantics ensuring that read-write races are not UB, and instead yield undef/poison on the read side. But that is something LLVM guarantees (but Rust does not, neither do C/C++). I don't see any connection to release-acquire synchronization.
More details can be read at this C++ proposal.
That's interesting, thanks! I am surprised though that they don't even mention the LLVM semantics, which are an alternative approach to solve the issue with SeqLocks. Seems like relevant related work.
I suspect there'll be opinions on stabilization, but seems fine to add as unstable at least to me!
Considering how much focus there is on Wasm, it would be great for LLVM "unordered" loads and stores to have a path to stabilization in order to have stable-Rust facilities for implementing host services for multi-threaded Wasm. (Since Wasm code is untrusted, if Wasm thread A calls into a host service and designates a range of Wasm heap for reading/writing, the Rust code implementing the host service can't be optimized with the assumption of freedom from data races in that memory region, because Wasm thread B could access the region without adhering to the Rust/C11/C++11 rules.)
Wait a second.
Actually, unordered loads and stores do not resolve your problem, if you have no way to guarantee that the adversary code is interacting with memory using unordered operations. As far as I can tell from the LLVM manual, unordered loads from memory that is written using non-unordered stores is still UB.
It could be argued that no sane compiler optimization or known hardware behavior could ever turn this UB into something dangerous in practice. But well... UB is UB, and we never know what the future holds.
So if you want to strictly avoid UB, the correct thing to do in your case is to tweak the WASM compiler so that all reads and writes to the WASM heap that may be used to interact with the host are atomic. They may be unordered, they may be relaxed, it doesn't really matter from an UB-correctness point of view (though it may matter from a performance point of view). But they _must_ be atomic.
Without these intrinsics, is there a supported way to read or write shared memory that is being concurrently and arbitrarily mutated by another processor? If the other processor is executing code in a different security boundary, then there may be no way to prevent it from mutating the memory with an arbitrary mix of atomic, non-atomic, and unaligned accesses. But clearly this works at the processor level, because this is the foundation for most OS and virtualization communcation models.
Right now, for example, Google's crosvm VMM uses an (arbitrary?) mix of ptr::{write_volatile, copy, write_bytes} while accessing memory that is shared between its process and a VM. It seems to me that at least some of this is UB today, and therefore future LLVM optimizations could break this code, but I wouldn't know how to fix this with what Rust offers today (especially while retaining good performance).
The requested intrinsics would seem to be what crosvm, and software like it, needs.
Without these intrinsics, is there a supported way to read or write shared memory that is being concurrently and arbitrarily mutated by another processor?
Yes, use atomic accesses.
A data race of an atomic and a non-atomic access in one program is UB if at least one of them is a write; but if another separately compiled program races with your atomic accesses, the only side that has to expect problems is that side. In some way, the data race is UB "only on the non-atomic side", which is not a distinction that usually makes any sense because UB is a whole-program property, but it does if one side is encapsulated and compiled separately.
Mind you, this is informal. You are outside of the scope of any formal model I am aware of.
Without these intrinsics, is there a supported way to read or write shared memory that is being concurrently and arbitrarily mutated by another processor?
Yes, use atomic accesses.
Over in the SharedArrayBuffer issue that motivates my comments here, I was told by a SpiderMonkey developer that C++11/C11 atomics are not a valid answer.
In some way, the data race is UB "only on the non-atomic side", which is not a distinction that usually makes any sense because UB is a whole-program property, but it does if one side is encapsulated and compiled separately.
Is there advice for programmers of virtual machines for how to use atomics safely with a non-conforming execution on the other processor such that LLVM devs agree that the advice may be safely relied upon?
The point seems to be
gcc is quite explicit about this in its documentation, saying that its __atomic intrinsics may not execute fences if gcc thinks they're not required.
Indeed compilers are allowed to optimize atomic accesses to non-atomic accesses if they can prove that that does not introduce a race or otherwise change behavior.
The problem we are having here is that the C/C++ memory model is inherently a "whole-program spec". The question we are asking though only makes any sense when considering "separate compilation". We want some kind of modular notion of correctness when linked with arbitrary code -- and worse, with the additional constraint that we have run-time mechanisms in place that ensure that the arbitrary code does not cause UB.
However, I rest my case that I think using atomic instructions will work for this. GCC can only remove synchronization when it knows all accesses done to a location. In case of a VM, I don't know how memory gets allocated and handed to the program inside the VM, but surely it happens in a way that the compiler must consider the pointer "escaped"? Even though the spec does not consider separate compilation, that's what compilers do in practice -- they have to compile a compilation unit in a way that works fine with any conforming other compilation unit. Hence they cannot remove fences from atomic accesses to locations that have "escaped".
But I will admit that this is all based on a somewhat idealized understanding of the VM and compilers, the details of which I am not extremely familiar with. I just cannot imagine a way to implement a compiler that supports separate compilation of concurrent code, such that using atomics for this use-case does not have the intended semantics.
One issue with these intrinsics is that they use LLVM's "unordered" atomic memory ordering, which Rust currently does not have, and I have some reservations to adding it.
Are there "relaxed" versions of these intrinsics that we could use instead?
Are there "relaxed" versions of these intrinsics that we could use instead?
Not sure. Anecdotally, I benchmarked a loop of relaxed atomic loads (loading from a slice of AtomicU8s using the Relaxed ordering), and found it to be ~10x slower than a loop of unsychronized loads (just accessing the elements of a slice of u8). I haven't benchmarked the unordered atomic memcpy intrinsics, but I suspect they'll be significantly faster.
For my use case in particular, speed is paramount, which is why I'm interested in these intrinsics. That said, I completely understand your concern with introducing "unordered" into our memory model. Perhaps a middle-road path would be to add these as unstable with a big warning in their doc comments saying "we will probably never stabilize these because of memory model issues"? That way we could still use them for the time being while we let the various discussions about the semantics of inter-process shared memory play out, and hopefully switch to the "official" solution once such a solution exists.
Not sure. Anecdotally, I benchmarked a loop of relaxed atomic loads (loading from a slice of AtomicU8s using the Relaxed ordering), and found it to be ~10x slower than a loop of unsychronized loads (just accessing the elements of a slice of u8). I haven't benchmarked the unordered atomic memcpy intrinsics, but I suspect they'll be significantly faster.
My current theory for why relaxed is so much slower is that LLVM is really bad at optimizing it.
add these as unstable
I am not opposed to any unstable experimentation. I am okay with giving people enough rope to hang themselves, at least on nightly. ;)
I am surprised though that this to be enough for you to actually use it.
@joshlf The cause of this particular problem is known. LLVM's optimizer is currently unable to fuse together neighboring relaxed atomic loads and stores, so a loop of AtomicU8 operations will translate into actual byte operations in hardware assembly, which is stupidly slow.
According to @comex, that's an optimizer bug, and the relaxed atomic semantics should actually allow for such code to be optimized into e.g. an efficient rep movq or SSE2 memcpy loop on x86_64.
inter-process shared memory play
Note that relaxed vs unordered has no effect on inter-process shared memory (between distrusting parties). General consensus is (to my knowledge) still that this does and will require volatile -- specifically, "atomic volatile", which LLVM has but Rust does not expose.
inter-process shared memory play
Note that relaxed vs unordered has no effect on inter-process shared memory (between distrusting parties). General consensus is (to my knowledge) still that this does and will require
volatile-- specifically, "atomic volatile", which LLVM has but Rust does not expose.
IIUC from the discussion in https://github.com/rust-lang/unsafe-code-guidelines/issues/152, normal volatile may be sufficient if combined with a freeze operation (which doesn't yet exist, of course)?
IIUC from the discussion in rust-lang/unsafe-code-guidelines#152, normal volatile may be sufficient if combined with a freeze operation (which doesn't yet exist, of course)?
That is true if we adapt LLVM's handling of read-write races. So, we would be deviating from C++ concurrency there as well. This is slightly better studied than LLVM's handling of write-write races, but that doesn't say much (as the latter is entirely unstudied to my knowledge).
Most helpful comment
Yes, use atomic accesses.
A data race of an atomic and a non-atomic access in one program is UB if at least one of them is a write; but if another separately compiled program races with your atomic accesses, the only side that has to expect problems is that side. In some way, the data race is UB "only on the non-atomic side", which is not a distinction that usually makes any sense because UB is a whole-program property, but it does if one side is encapsulated and compiled separately.
Mind you, this is informal. You are outside of the scope of any formal model I am aware of.