This is still blocked on some reasonable way to talk about platform-specificity in a more granular way than what we currently support.
Is there any reason why AtomicU32
would be hard to stabilize? We need a way to store multi-word bitflags in an atomic thingy. AtomicUsize
won't work here since it's too large on 64 bit.
What do you mean by "too large"?
Let's say I have a 64-bit bucket of flags. I need two atomic u32s to represent it. I can't use two usizes because that's too large on 64 bit.
Alternatively, let's say I have a 32-bit bucket of flags. This is on a struct where size matters. I need a u32
, not something larger on 64-bit.
I guess I could cfg()
it and abstract over it. That solves the first problem but not the second.
@Manishearth 16-bit-usize targets would give you 16 bit atomic width. 100% savings over 32bit systems! (and potential loss of data)
There isn’t really much problem stabilising any of these IMO, but you’ll still have to CFG on target_has_atomic="32"
in your code.
16-bit-usize targets would give you 16 bit atomic width.
ugh, these exist, right.
What do I have to do to get these stabilized? I'm okay with the cfgs.
I'm also okay with just u8 being stabilized, as long as I can pack it (not sure if this is possible with atomics)
What do I have to do to get these stabilized?
Any answer on this? cc @alexcrichton
The rfc looks like it's been completely implemented. IMO stabilizing it with target_has_atomic
is fine, and we can add further granularity if we need in new rfcs.
@Manishearth state hasn't changed from before. The libs team is essentially waiting for a conclusion on the scenarios story before stabilizing.
That discussion has stalled. Last I checked it seemed to mostly have come to consensus on the general idea? Who are we waiting on for a conclusion here? What would the ETA on this be? How can I help?
It also seems like that's mostly orthogonal to this. We can stabilize this using cfgs, and add scenarios later. They seem to be pretty compatible.
(After discussing in IRC, it seems like there have been a lot of recent discussions about this, and the libs team is currently moving on pushing this forward.)
Has there been any progress on this since?
Just FYI: Per [1], Gecko's Quantum CSS project is no longer blocked on this issue.
We can stabilise AtomicU8 and AtomicI8 the same way we have stabilised AtomicBool – by using Atomic{I/U}Size operations everywhere.
FWIW, my integer-atomics crate is a stopgap solution for everyone who is stuck on stable for now. However, it emulates using AtomicUsize
cmpxchg-loops, so performance is bad compared to genuine atomic operations.
Is there still a good reason preventing this from being stabilized?
Portability is still the concern I believe.
What C++ does here is provide the atomics on a best-effort basis, and those that aren't supported by the target are emulated using a mutex
or similar.
The alternative is to just provide the atomic types in the architectures in which they are actually available. I prefer this approach since it allows third-party crates to provide a portable solution in stable, but doesn't require us to do so in core
.
It also seems that progress on scenarios is dead?
I think scenarios became https://github.com/rust-lang/rfcs/pull/1868, so these would be #[cfg]
'd but they'd be part of the mainstream configs so most people could use them without getting warnings or doing anything special?
Is anyone working on finishing the implementation of this RFC? If not, I would like to give it a try and would be looking for a mentor.
The atomic types are currently already guarded by cfg(target_has_atomic)
so there shouldn't be any trouble integrating this with a portability lint in the future.
I was mostly interested on the parts of the original RFC that are not implemented yet, like the 128bit atomic types. I wanted to have an atomic pair of usizes, and can't really do so portably without those types.
See #39590 and #38959
I think the ATOMIC_*_INIT
constants of the not-yet stabilized atomics should be removed before stabilization, they're a crutch from a time without stable const functions.
Can we get an update, what are the current blockers for stabilization?
This is something I'm quite eager for, and if there's any work that could be shared, I would be happy to take some on to get this out the door.
AFAIK the main blocker here for stabilization is that these are not platform-agnostic types. Almost everything in the standard library is available on all platforms in one way or another, and these would be the first additions to libstd in a non-platform-specific location that aren't available on some "upper tier" platforms.
We in the standard library do not have a story of how to provide access these types while maintaining the platform portability guarantees of the standard library. The "portability lint" was supposed to unblock this in theory. Effort has stalled out on that and this, however.
I believe there aren't any technical issues blocking this, only issues around how we expose these APIs in libstd and where we expose them.
AFAIK the main blocker here for stabilization is that these are not platform-agnostic types. Almost everything in the standard library is available on all platforms in one way or another, and these would be the first additions to libstd in a non-platform-specific location that aren't available on some "upper tier" platforms.
Could you elaborate about what exactly is the problem here? We already have the unsigned atomic integer types, and AFACT other types of the RFC can be implemented for all platforms as a "thin" wrapper over those, e.g., the AtomicIX
can be implemented on top of the AtomicUX
types, the AtomicBool
type on top of AtomicU8
, and the pointer types on top of AtomicUsize
.
The AtomicUX
types are not stable.
@gnzlbg currently the existence of AtomicU8
is a promise that the platform actually has instructions which operate on just one byte as opposed to implementing some form of emulation and/or fallback in libstd. In that sense not all architectures have all the types (notably AtomicU64
is lacking on a few).
maybe we could move some of these to std::arch
?
AtomicBool
is already stable so we already promise we can support one byte atomic operations so there's no reason we can't stabilise AtomicU8
and AtomicI8
now. 16 bit atomics could be stabilised as well because AtomicUsize
is stable and usize
is at least 16 bits. I would even argue that we can stabilise 32 bit atomics because it's only 16 bit platforms that might not support them and they probably have bigger portability concerns than whether or not AtomicU32
and AtomicI32
are available.
@gnzlbg it's true! That didn't exist when this issue was created, but it seems like a reasonable-ish place to me to put these types if literally the only thing blocking them is the portability part (which I think is the state of play right now)
@ollie27 the part about AtomicBool
could be considered an accidental regression from #33579 because target_has_atomic = "ptr"
isn't necessarily guaranteed to be the same as target_has_atomic = "8"
, although it may be for most of our platforms we have today. AtomicBool
specifically has been around since 1.0, so our hands are tied there but we can possibly make more proactive decisions about future types.
because
target_has_atomic = "ptr"
isn't necessarily guaranteed to be the same astarget_has_atomic = "8"
Why wouldn't it be possible to implement AtomicBool
or AtomicU8
in terms of AtomicUsize
?
because target_has_atomic = "ptr" isn't necessarily guaranteed to be the same as target_has_atomic = "8",
A smaller atomic operation can always be synthesized from a larger one using a compare-exchange loop: modify a subset of the larger word, and use compare-exchange to attempt to commit the results, loop until it succeeds.
@Amanieu I wonder how that works for an [AtomicU8; N]
. In a target that only has 32-bit atomic operations, can the AtomicU8
be 8 bits wide or do they have to be larger for that to work? If they are 8 bits wide, how does this work ? I mean, if you are applying 32-bit wide atomic operaitons on the u8
s, you would be doing operation on neighborging bytes, might end up reading out-of-bounds, etc. or how is this avoided?
@Amanieu Sounds like https://docs.rs/integer-atomics?
The trick seems wildly unsound at first sight (because we're touching memory falling outside the small atomic value), but... is actually not?
@stjepang Yes, that's exactly what I am talking about.
@gnzlbg This is "safe" in practice since a) the 32-bit atomic operation is 4-byte aligned and therefore can't cross a page boundary, and b) the compare-exchange loop guarantees that the neighboring bytes are never modified by the operation: if they were then the compare-exchange would fail and the loop will retry with the new value for the neighboring bytes.
@Amanieu That analysis ignores the compiler, and LLVM alias analysis heavily relies on assumptions like getelementptr inbounds
and "all accesses are inbounds". Feel free to do this in inline assembly, but if it goes through the optimizer I'd say that's extremely risky.
I'd say that's extremely risky.
It's not risky, it is UB, this code should fail to run on miri
(see https://github.com/rust-rfcs/unsafe-code-guidelines/issues/2).
@RalfJung This transformation is actually done inside LLVM inside in the backend passes where it lowers atomic intrinsics to arch-specific opcodes.
@Amanieu I think what @RalfJung means is that the optimization passes that run before the backend are allowed to assume that reads/writes out-of-bounds never happen in the input IR and can therefore optimize under this assumption.
Just because generating machine code that reads/writes out-of-bounds is ok for some targets does not imply that writing Rust code that does (or passing LLVM IR that does) is also ok.
Reads out-of-bounds are undefined behavior in Rust, C, C++, and AFAIK LLVM IR as well, so this is very unlikely to ever be ok (but see the corresponding issue: https://github.com/rust-rfcs/unsafe-code-guidelines/issues/2). Whether it is ok to do so via inline assembly, I don't know, but as long as you clobber everything that the inline assembly modifies (including the memory out-of-bounds), that would probably be ok.
But anyways, my point is that any platform that supports one size for atomic operations will automatically support all smaller sizes (even if we need to write custom implementations in inline asm). The only limit is that maximum size that a platform can atomically CAS.
This transformation is actually done inside LLVM inside in the backend passes where it lowers atomic intrinsics to arch-specific opcodes.
That changes nothing. Backend passes have different rules for UB, and they do not do the aggressive alias analysis that is performed during the main optimization phase.
Another way to say this is that backend passes operate on a different language, even if it shares the syntax with "main" LLVM IR. Just because something is allowed in assembly or "low-level LLVM IR" doesn't mean it is allowed in "normal LLVM IR", and the rules of the latter are relevant for Rust. Not doing this properly will lead to miscompilations.
@ollie27 @Amanieu yes it's definitely possible (as the discussion has found) to implement atomics this way. The libs team has historically, however, decided to define the existence of AtomicU8
as "there are native instructions for this". Emulation layers are left for crates.io (like integer-atomics
).
As to whether such a strategy is even sound, I'll leave that to others!
@alexcrichton I remember the libs team decision being that standard library atomic types would be guaranteed to be lock-free, rather than simply "there are native instructions for this". The lock-free property is important since it guarantees that atomic operations can be safely used to communicate with a signal/interrupt handler.
A non-lock-free implementation of atomics would use a spinlock or mutex to emulate atomicity. However the algorithm that I described above (CAS loop which modifies a subset of a word) is lock-free since it does not block while waiting for other threads. Effectively it just uses a slightly longer instruction sequence than you might expect.
It is a bit late to back down on this decision since we already guaranteed the availability of atomics on older ARM architectures (pre-ARMv6) using this exact algorithm: https://github.com/rust-lang-nursery/compiler-builtins/blob/master/src/arm_linux.rs
Hm I may be misremembering the libs team decision! I know for sure we want to guarantee lock-free, but I would personally like to also guarantee hardware/platform support. The pre-ARMv6 strategy you pointed out there was accepted because it's linux-specific, and we may have been overly zealous to accept non-u32 aligned ones there. AFAIK it's not stable other than AtomicBool
, so I think we can still end up dropping most of those (and maybe on that one platform switch AtomicBool
to word-sized.
I'd prefer to ideally not try to corner ourselves into a decision with historical debt, but rather figure out where we want to go and then work backwards from there. I would personally ideally only like to expose AtomicUXX
types for a platform if they're actually supported natively one way or another (either via hardware instructions or the OS level support like pre-ARMv6). We can always add more after that and I think it covers the vast majority of use cases. In the meantime integer-atomics
can fill any necessary gaps.
I guess this is a difference of perspective: I consider a CAS loop to be "using native hardware instructions" since it uses the native atomic CAS instruction, just with a larger word size. In fact many atomic operations are lowered this way: for example on ARM, a fetch_add
is lowered into a ldrex
/ strex
loop which retries if the destination cache line has been modified by another thread.
Anyways, I feel that we are going off topic here. The main point of this issue is that we would like to have integer atomic types available on stable rust. There is a vague concern about the portability, but no concrete proposals to address this.
I personally feel that the current system of having exposing #[cfg(target_has_atomic)]
to users is sufficient to address this portability concern. If a crate doesn't compile because AtomicU64
is missing on some platform, the cause should be obvious. We could even (later) add a better error message when attempting to import a nonexistent AtomicU64
.
Placing integer atomic types in std::arch
is strictly worse than what we have now. These types belong in std::sync::atomic
alongside AtomicUsize
and AtomicBool
.
IIRC our primary concern was not wanting to end up with AtomicU64
implemented via Mutex<u64>
on platforms without 64 bit atomics.
I personally feel that the current system of having exposing
#[cfg(target_has_atomic)]
to users is sufficient to address this portability concern.
The objective of the portability lint is to help people be aware upfront of the portability of the code they're writing. With only #[cfg(target_has_atomic)]
, a crate author has to proactively cfg off their APIs and probably test against targets without those atomics to avoid inevitably breaking the less common case.
@Amanieu the historical portability precedent set by libstd is that all modules are available everywhere except those explicitly whitelisted as "may require some #[cfg] trickery". For example std::os
is explicitly defined as "this may require some cfgs", and so is std::arch
. In that sense the current instantiation of atomic integer types doesn't fit into this story becaues they're not in a module clearly marked "may require cfgs".
The concrete proposal was to move all these types to the arch
modules. I'm realizing now though that doesn't work on arm because target features are required to enable atomic types, not necessarily an architecture-specific type. (Same with WebAssembly and x86, atomic types (or some at least) are only available with certain target features enabled).
This is the main blocker today, these types violate our existing portability story. The solution to this was the venerable portability lint, but that's still quite aways off. If we want to short-circuit the portability lint then we need to develop a stabilization strategy that works in tandem with std's portability policy.
Then why not simply add std::sync::atomic
to the list of modules that require #[cfg] trickery? This seems like the best solution at this point.
This doesn't block the portability lint: the link can simply check that you have wrapped your atomic usage in #[cfg(target_has_atomic)]
.
I don't know if this applies to anyone else, but as a user, I'm primarily interested in AtomicU32
/AtomicI32
, because there are lots of APIs that involve 32-bit atomic values on 64-bit platforms. If every platform has 32-bit CAS and therefore can support these types acceptably, couldn't they be stabilized immediately? :-)
If every platform also has smaller CAS, or if it's deemed acceptable to synthesize smaller atomics using "oversize" CAS, the smaller atomics could be stabilized immediately as well.
Basically it seems like the only truly non-portable case might be AtomicI64
/AtomicU64
, so perhaps only those types really need to wait for the portability lint to be sorted out. And since all the platforms I care about are 64-bit (I'll never run my proprietary code on 32-bit), I won't miss them because I can just use AtomicUsize
/AtomicIsize
instead.
It's one possible solution, yeah, to simply say types in std::sync::atomic
are not platform-agnostic.
I wanted to get a grasp on what concrete portability story we're talking about, and I wasn't aware of any analysis done here recently, so I've run the compiler over a bunch of targets to see what the various sized swap
instructions generated and got this table:
| Target | AtomicU8::swap
| AtomicU16::swap
| AtomicU32::swap
| AtomicU64::swap
|
|------|--------------------|-----------------------|--------------|---------------------|
| x86_64-unknown-linux-gnu
| xchgb
| xchgw
| xchgl
| xchgq
|
| x86_64-apple-darwin
| xchgb
| xchgw
| xchgl
| xchgq
|
| i686-unknown-linux-gnu
| xchgb
| xchgw
| xchgl
| cmpxchg8b
|
| i586-unknown-linux-gnu
| xchgb
| xchgw
| xchgl
| cmpxchg8b
|
| arm-unknown-linux-gnueabi
| ldrexb
| ldrexh
| ldrex
| ldrexd
|
| arm-unknown-linux-gnueabihf
| ldrexb
| ldrexh
| ldrex
| ldrexd
|
| armv7-unknown-linux-gnueabihf
| ldrexb
| ldrexh
| ldrex
| ldrexd
|
| mips-unknown-linux-gnu
| ll
/sc
(BUG) | ll
/sc
(BUG) | ll
/sc
| N/A |
| mips64-unknown-linux-gnuabi64
| ll
/sc
(BUG) | ll
/sc
(BUG) | ll
/sc
| lld
/scd
|
| powerpc-unknown-linux-gnu
| ?? (BUG?) | ?? (BUG?) | lwarx
/stwcx
| N/A |
| powerpc64-unknown-linux-gnu
| ?? (BUG?) | ?? (BUG?) | lwarx
/stwcx
| ldarx
/stdcx
|
| aarch64-unknown-linux-gnu
| ldxrb
| ldxrh
| ldxr
| ldxr
|
| thumbv6m-none-eabi
| (no swap) | (no swap) | (no swap) | N/A |
| thumbv7m-none-eabi
| ldm
| ldm
| ldmda
| N/A |
| thumbv7em-none-eabi
| ldrexb
| ldrexh
| ldrex
| N/A |
The points of note are:
swap
. IIRC it's just load/store and other more simplistic operationsAll types look to be available on all other platforms tested. This doesn't cover architectures like s390x, sparc, wasm, probably some arm variant, etc.
For targets that are "practically up there in their level of support" that's pretty bleak...
I think I would personally push back against simply saying the types are stable as-is today. We have no precedent for these sort of types with varying support across platforms (at least of this prominence and this level of support) being in libstd without a clear warning about portability.
The portability problem has already been gotten wrong with SIMD which has tons and tons of warnings about how platform-specific it is, and this represents yet-another-portability-hazard if it's in such a prominent place as std::sync::atomic
.
I'm ok with the solution of moving these types to std::arch
myself, however, as that has clear warnings about portability and is I feel the best we can do at this time.
@willmo empirically it looks like AtomicU32
is indeed supported everywhere I tested at least!
Regarding your question about ARM architectures: armv5te and thumv6 targets don't support atomics, except that armv5te emulates them with Linux kernel support.
I disagree with your "(BUG)" comments: the ll/sc loop is the standard way of performing sub-word atomic operations on those platforms. It is just a more complicated version of the ll/sc loop used for word-sized atomic operations.
In short there are really only 3 categories for targets:
With the last category, we already have a stabilized precedent for variations in support for std::sync::atomic: thumbv6 doesn't support atomics at all. Also I feel that moving atomic types to std::arch::$arch
will actually make code less portable. Code using integer atomic will now have to be specialized for every architecture:
#[cfg(target_arch = x86)]
use std::arch::x86::atomic::AtomicU32;
#[cfg(target_arch = x86_64)]
use std::arch::x86_64::atomic::AtomicU32;
// Oops, now this code will only work on x86 despite the fact that it would work
// just as well on ARM, PowerPC, MIPS, etc.
And even then, there is still varying atomic support within an architecture. This is particularly true for ARM, but it is also the case on x86_64: AtomicU128 is support on all x86_64 chips except the earliest ones from AMD.
In conclusion, I don't feel that moving atomic types to std::arch
actually solves any problems, and instead introduces new ones. I feel that the current (unstable) situation of having atomic types conditionally available depending on the target is the best approach to take. Look at it this way: if a crate is found not to compile on some architecture due to missing atomic support, an issue will be opened on Github and the problem will be quickly solved.
Oh sorry yeah by "BUG" I meant that it didn't follow what I assumed to be our contract, that we only provide atomic types which match exactly with the architecture in question, excluding the fact that any smaller atomic operations can be implemented in terms of larger ones. It's fine for that to be a separable question, I don't mean for it to get in the way.
It's true that atomics on ARM are sort of odd! I'm not sure what to really do about that. That being said most of the platforms that don't have atomics are pretty low down on the platform support tiers, so we could relegate them to "unresolved questions" like targets without floats rather than having them block other designs.
It's true that moving these into std::arch
would have to have special code per every architecture. A crate on crates.io, however, could reexport a portable interface which does all the multiplexing and has documented fallbacks or options for what fallbacks should do on unsupported platforms.
I personally disagree that std::arch
doesn't solve any problems, but I do agree that it creates an ergonomic barrier to use the types. I feel it's clearly signaling that these operations aren't 100% portable as most of the rest of the standard library already is. These are already somewhat niche types so the ergonomics I don't think are so important as AtomicUsize
and friends.
Using std::arch
, in my mind, is basically entirely centered around:
From this discussion I conclude that there are basically three tiers of platform support for any atomic operation:
Two directly conflicting goals are a) portability and b) protecting the programmer from a potential performance footgun (using emulated atomics over a more efficient solution).
I think these very different goals deserve different treatment: While an application targeted towards a machine that only offers 32-bit atomics is almost always better using 32-bit atomics instead of loop-emulated 16-bit atomics (potentially wasting some memory though), I'd much rather have a library fall back to loop emulation than getting a compile error - the slight performance impact is just not worth the despair of having to dive into a foreign codebase in order to fix portability errors.
Mutex emulation is less obvious considering that it can slow down an application by several orders of magnitude. But I would argue that from a portability standpoint ("I'm trying to use someone else's code on my platform") even this is generally acceptable.
Similar to how I don't get linter warnings for dependency crates, this is how I think the compiler should react to different kinds of atomic calls:
Compile type | Native support | Loop emulation | Mutex emulation
-----------------|--------------------|---------------------|----------------------
Local | fine | warning | warning
Dependency | fine | fine | warning
This gives me a heads up if a crate I'm depending on is going to be severely slower while avoiding accidental performance problems from my own code.
I generally agree with your sentiment about tiers of atomic support. But note that some popular RISC platforms, such as 32-bit ARM, require all atomics to be implemented using load-linked/store-conditional loops. I would be wary of linting on those, so it feels to me that such a lint should be allow-by-default.
Even on x86, some atomic operations are implemented using a cmpxchg
loop, e.g. fetch_and
. Implementing atomic operations using a loop is completely normal and should be treated the same way as native support.
The performance isn't actually the reason why we make a distinction between so called "lock-free" atomics and ones emulated with a mutex. If you are using atomics to share data between main code and a signal/interrupt handler then you must use lock-free atomics, otherwise your code may deadlock. This can happen if the interrupt happens while the mutex used for atomic emulation is locked.
Maybe a bit of extra terminology could help here. When atomics are implemented using CAS or LL/SC loops, they are lock-free but not wait free.
In simple terms, lock-freedom means no single thread can block every other thread if 1/it only interacts with them via atomics and 2/the atomics are not used in a loop to implement a higher-level lock.
Wait-freedom, in contrast, means that a thread cannot be infinitely delayed by other threads hammering the atomic in a loop. This is obviously not true of an atomic operation that is implemented via a CAS or LL/SC loop.
Since we cannot provide wait-free atomics on some platforms, we may want to clarify in the documentation that atomics are only guaranteed to be lock-free, not wait-free.
If #{cfg(accessible(...)]
is added I think it will cover all use cases of #[cfg(target_has_atomic="x")]
, as users can then check for the presence of the atomic types directly.
Then one use case I could think of, if smaller atomics are emulated and therefore always present but target_has_atomic="x"
is only true if there is native support, is using it to check for wait-freeness.
But from Amanieu's comment above, that would only tell whether store()
is wait-free, so people will need to check that the specific operations they need are wait-free.
Now, accessible(...)
hasn't even passed the RFC stage yet, but it seems like a much cleaner solution than exposing a target_has_xxx
attribute for each kind of target feature.
Stabilizing the types without either accessible(...)
or target_has_atomic="x"
being available would still be useful, as people can use target_arch="x"
which while less portable, offers much stronger guarantees.
I think that documenting that atomics are at least lock-free and possibly wait-free should be enough to move forward. An optional documentation feature that would make this pretty damn perfect: also documenting which platforms/atomic sizes are wait-free.
In general, I completely agree with @Amanieu that CAS loop (AKA lock-free but not wait-free) atomics would be considered "native atomic support" by basically everyone. But going above-and-beyond by documenting this should settle any concerns.
I'm nominating this for discussion at the next libs triage meeting, but to try to make progress on this discussion I'd like to separate out a few points. If others have thoughts on these (or other points), please let me know!
AtomicXYY
, if they exist in libstd, guarantee they are lock-freeAtomicXYY
types do not guarantee they are wait-free, and I think this is where we are todayAtomicU8
with AtomicU32
operations.AtomicU8
in terms of AtomicU32
(in terms of LLVM guarantees and whatnot). This may only be a valid thing for LLVM's backend code generator to generateAtomicU8
to be implemented in terms of AtomicU32
std::sync::atomic
? Others in std::arch
? All in std::sync::atomic
?The main two options for placement of these types are:
std::arch
and exposed as they're available. This likely wouldn't stabilize the target_has_atomic
cfg directive.std::sync::atomic
and the target_has_atomic
cfg directive is also stabilizedThe second option is whether or not to stabilize the emulated atomics, and that's just a question of whether the APIs are stabilized or not.
AtomicXYY
types do not guarantee they are wait-free, and I think this is where we are today
This is fine, keep in mind that C++11 atomics (which we are based on) makes no guarantees about wait-freedom either.
- It's unclear (to me at least) what to do about "emulation" of small size atomics using larger-sized atomics. For example emulating
AtomicU8
withAtomicU32
operations.
I think it's fine to defer this issue. Currently none of the built-in targets make use of the min-atomic-width
attribute. This was only added in #38579 to support the out-of-tree OR1K target, and even then it should be possible to implement an emulation for those in compiler-builtins. In any case, this doesn't block stabilization as you mentioned at the end.
Ok we've discussed this in a recent @rust-lang/libs triage, and the conclusion was that the proposal to stabilize all these types as-is is probably the way to go. The stabilization would be coupled with documentation updates indicating that these aren't as portable as, say, Add for u8
, but they're available on most platforms. Additionally it was concluded that stabilizing smaller-size atomics for platforms that only have larger-size atomics was fine to do.
I believe this is generally the trend of this thread anyway, so I'm going to open a dedicated thread and FCP this for stable
Ok for those following along here, I've opened a formal proposal for stabilization at https://github.com/rust-lang/rust/issues/56753, feedback of course is always welcome!
smaller-size atomics for platforms that only have larger-size atomics
Just to be sure, the encodings of smaller-sized atomics in terms of larger-sized atomics happens by LLVM, as part of LLVM lowering to machine-specific IR or so? I maintain that doing this at the level of Rust, MIR or LLVM IR is illegal because of potential out-of-bounds accesses, and we shouldn't do it.
@RalfJung correct, that's what convinced me personally that we can't do this on crates.io, which means if we want it at all we need it in the standard library (via LLVM intrinsics). I think we want it, so I'm convinced to put it into libstd :)
@RalfJung This lowering is done either within LLVM, or through a function in compiler_builtins.
The latter is currently only used on armv5te-unknown-linux-gnu at the moment, and uses this code. It could be argued that this is UB since intrinsics::atomic_load_unordered
could be used to read out-of-bounds data, however this is guaranteed not to fault because it doesn't cross a page boundary.
@alexcrichton makes sense!
@Amanieu
It could be argued that this is UB
And the argument would be correct :)
is guaranteed not to fault because it doesn't cross a page boundary.
And as in the last N cases we have had this argument (and as I am sure you are aware, but not everybody else might be), that doesn't change anything about this being UB when we are talking about code written in Rust, MIR or LLVM IR. ;) (I am beginning to feel sorry for being so annoying about this, but LLVM is way too smart and getting smarter every day, so I am actively worried that such arguments will blow in our face some day.)
Is this a pattern supported/intended by LLVM? Is there advise from the LLVM devs for how to do this?
Is there any chance of LLVM ever inlining those compiler-builtins functions? Actually even having them in the same translation unit could be enough to cause problems, because LLVM could infer attributes on the functions to propagate information about what they do out to use sites.
One safer alternative would be to use inline assembly to implement such operations, that would most likely exclude any way for LLVM to notice that there are out-of-bounds accesses. But I am not sure if that's an option here.
The code is more-or-less based on the GCC implementation, which gets away with a normal atomic load.
Would changing the load to a volatile atomic load help in this case?
To add to what @RalfJung is saying, @Amanieu just because it works at the hardware layer doesn't mean it's UB in LLVM's IR. For example this function:
define i8 @bar() {
start:
%a = alloca i8
store i8 0, i8* %a
%b = call i8 @foo(i8* %a)
ret i8 %b
}
define internal i8 @foo(i8*) {
start:
%b = getelementptr i8, i8* %0, i32 1
%a = load i8, i8* %b
ret i8 %a
}
is sort of a simplisitic view but it's guaranteed to never fault because the out-of-bounds load will just load some byte of the return address on the call stacsk or something weird like that. When optimized, however, it yields:
define i8 @bar() local_unnamed_addr #0 {
start:
ret i8 undef
}
(a showing that this is undefined behavior)
LLVM can't automatically deduce that all instances of this pattern is undefined behavior, in isolation foo
optimized just fine. That's why compiler-builtins happens to work, we're forcing LLVM to have less knowledge about the inputs so it just-so-happens that it can't deduce that undefined behavior is happening.
All that's just to say that @RalfJung I think is totally correct here, a crates.io based implementation of smaller-sized atomics with larger-sized atomics I think is just a segfault waiting to happen. LLVM may not even detect it's UB today, but it's definitely UB at the LLVM IR layer (and probably the Rust layer) to read out of bounds on objects. Why exactly it's UB or what exactly happens is always up for grabs which is why it works most of the time, but this is fundamentally why we need LLVM's backend to do the lowering because the IR passes need to see that we're just modifying/loading one byte, not the bytes around it
The operations that @Amanieu wants to perform cannot be performed by a programming language generating LLVM-IR directly. Inline assembly appears to be the only way to perform these right now, so we could still expose them _I think_ (@RalfJung ? I don't know whether compiler-builtins would work too).
In the meantime, I think it would be better to open an issue in the LLVM bugzilla about this, explaining why these operations are useful, why the LLVM-IR generated for them has undefined behavior, and how that requires us to use inline assembly (or modify compiler-builtins) instead. We should ask: what should we do? Should we use inline assembly / our own compiler built-ins ? Will LLVM expose intrinsics to allow these safely? etc.
It might be worth mentioning that this is not the only situation in which we need to perform reads out-of-bounds (see https://github.com/rust-rfcs/unsafe-code-guidelines/issues/2).
Would changing the load to a volatile atomic load help in this case?
No. Volatile reads in practice have some positive effects on racy reads (but LLVM may change those rules any time as we are relying on de-facto behavior here). It doesn't change anything about the requirement that accesses must be in-bounds.
The proper way to fox this is (as @gnzlbg mentioned) to add an attribute to LLVM that can be set on reads/writes and that indicates that the access may be partially out-of-bounds. Then we need a matching intrinsic in Rust, and methods such as read_out_of_bounds
and write_out_of_bounds
on pointers. Considering we need this for concurrency, we'd also need to think about how to expose atomic out-of-bounds accesses in Rust. Anything else (anything just arguing based on page boundaries but not informing LLVM) will remain a hack. Given that this seems to be a useful pattern, I absolutely think we should lobby for LLVM to add such an attribute!
just because it works at the hardware layer doesn't mean it's UB in LLVM's IR. For example this function
Thanks for the example, I'll link to this when such discussions come up again in the future. :)
That's why compiler-builtins happens to work, we're forcing LLVM to have less knowledge about the inputs so it just-so-happens that it can't deduce that undefined behavior is happening.
That sounds way less confident than I had hoped...
When does compiler-builtins get linked with the real program? Is there a chance that LTO might inline compiler-builtins functions (which then would mean LLVM could deduce the UB)?
When does compiler-builtins get linked with the real program? Is there a chance that LTO might inline compiler-builtins functions (which then would mean LLVM could deduce the UB)?
As rtlib calls are only inserted at the SelectionDAG layer, while LTO still operates on LLVM IR, I don't believe there is any possibility of these getting inlined.
@nikic is correct, we explicitly don't LTO compiler-builtins as well (it's a very special crate). In that sense there's no worry for inlining compiler-builtins intrinsics.
Okay. I can live with that. We should keep it in mind though for the future, if/when compiler-builtins treatment ever changes.
So, yeah, I agree we should go forward with such "emulated" small-int atomics implemented via LLVM lowering or compiler-builtints.
For those following this thread, the stabilization proposal is now in FCP
Can i ask a question (just curious) ? It seems that constants like ATOMIC_I64_INIT are marked as stable since 1.34 and deprecated since 1.34 at the same time. Why stabilize something that is deprecated? It may be just my opinion, but i think that getting new stable feature that is deprecated from the beginning is rather strange...
Nice catch! I think we should just remove those constants.
That's convincing to me, @macpp -- opened https://github.com/rust-lang/rust/issues/58089 to track it.
This is listed as the tracking issue for cfg_target_has_atomic
, which is still unstable. Should this be reopened?
Yep - reopened.
Removing T-Libs since this is a pure language feature.
Is there any progress on this? Can anyone explain what cfg_target_has_atomic
is blocked on? I.e. what are questions we need to resolve before stabilizing?
AtomicU32
was stabilized for 1.34.0
I have one objection to the way target_has_atomic = "cas"
works. I would prefer if we split this into two separate cfg
s:
target_has_atomic = 8/16/32/64/128
: This indicates the largest width that the target can atomically CAS (which implies support for all atomic operations).target_has_atomic_load_store = 8/16/32/64/128
: This indicates the largest width that the target can support loading or storing atomically (but may not support CAS).(bikeshed: maybe a slightly shorter name target_has_atomic_ldst
)
Is CAS the only operation that we'd need to call out that way (e.g. are there any platforms we care about that have atomic load/store but not swap)?
It seems like we should be able to stabilize target_has_atomic itself though with @Amanieu's definition.
thumbv6 has load, store, but no swap or cas
Does thumbv6 have any kind of read-modify-write instruction? Maybe presence or absence of atomic RMW instructions could be the right discrimination criterion...
No, thumbv6 has nothing of the sort. Perhaps a better name would be #[cfg(target_has_atomic = "rmw")]
, but that still doesn't really capture the swap operation.
Why ? Swap reads the old value, replaces it with the new one, and writes that in a single atomic transaction, so it is RMW in my book.
cc #65214
Why ? Swap reads the old value, replaces it with the new one, and writes that in a single atomic transaction, so it is RMW in my book.
Yeah, fair point.
Does thumbv6 have any kind of read-modify-write instruction? Maybe presence or absence of atomic RMW instructions could be the right discrimination criterion...
I think a CAS cfg is correct because all the other RMW operations can be implemented with it, but having one RMW operation like swap doesn't allow you to implement the rest. So, for targets that just have a swap,fetch_add etc., but not CAS we might need more cfgs, but I don't think it would add enough value to be worth it.
Good point! I think we can agree on the following conclusion:
target_has_atomic_test_and_set
), because test-and-set is all you need to implement a mutex, and a mutex is all you need to emulate any other atomic instruction _in a blocking manner_.Since Rust atomics are guaranteed to be at least lock-free, this substitution cannot be done silently by std and must be performed manually on the user's side. Therefore, it is not transparent and must be exposed by a cfg, if and when the situation arises.
All this is conditional on the existence of hardware which has some atomic RMW instructions, but none with infinite consensus number. I'm not personally aware of any, but embedded chips and legacy hardware are full of surprises so it's best to keep that door open at the syntax level.
I believe that the syntax proposed by @Amanieu (target_has_atomic vs target_has_atomic_load_store) does so, therefore I'm happy with it.
If we find hardware which supports e.g. test-and-set but not CAS, then we may want support it as well with a finer-grained cfg (e.g. target_has_atomic_test_and_set), because test-and-set is all you need to implement a mutex
However, "normal" sequentially consistent loads and stores are also sufficient to implement a Mutex using Dekker's algorithm (for 2 CPUs) or Peterson's algorithm (for any number of CPUs). Now I wonder, how does that fit in?
All this is conditional on the existence of hardware which has some atomic RMW instructions, but none with infinite consensus number. I'm not personally aware of any, but embedded chips and legacy hardware are full of surprises so it's best to keep that door open at the syntax level.
Some old ARM chips (~ARMv5) only have an atomic SWP instruction and nothing else. However neither GCC nor LLVM actually use this instruction for atomics so atomics are unsupported on this architectures.
IMO we should follow the same general policy: only support atomic operations if all of them are supported (which essentially boils down to whether CAS is supported since you can use it to emulate the others).
Having access to limited atomic operations might still be useful for some niche applications (eg. on ARM7TDMI, which is still somewhat widespread), so I think it would be unfortunate if these use cases are prevented by a matter of policy.
If we find hardware which supports e.g. test-and-set but not CAS, then we may want support it as well with a finer-grained cfg (e.g. target_has_atomic_test_and_set), because test-and-set is all you need to implement a mutex
However, "normal" sequentially consistent loads and stores are also sufficient to implement a Mutex using Dekker's algorithm (for 2 CPUs) or Peterson's algorithm (for any number of CPUs). Now I wonder, how does that fit in?
Given that...
I don't think that these algorithms are applicable outside of very constrained embedded scenarios where the target hardware is exactly known and hardware portability is not desired at all.
if a hardware architecture cares so little about concurrency that it does not even expose a test-and-set instruction, it is unlikely to provide the required memory barriers for SeqCst ordering,
Even thumbv6 has fully working loads and stores, despite it not having anything more sophisticated than that (no swap, CAS, or anything else). These are still sufficient for implementing things like SPSC queues.
Thumbv6 is also used in multicore processors, often alongside a more powerful Cortex-M3/M4 core (which is thumbv7 and does have CAS, etc.). This means that implementing a Mutex using one of the algorithms Ralf linked above might actually make sense on these MCUs. Manufacturers of these MCUs also provide peripherals that provide synchronization primitives, but these are often specific to the MCU family and don't exist on others.
@jonas-schievink By "fully working loads and stores", do you mean that if appropriate memory barriers are inserted in the right place, it is possible to ensure that if CPU core 1 writes to a first memory location, and CPU core 2 writes to a different memory location, all CPU cores in the system that read from both locations will see these two writes occuring in the same order ?
The reason I'm asking is that I have recently heard about a paper by concurrency researchers that I would tend to trust, which claimed that this guarantee, which is pretty much the defining characteristic of SeqCst atomic memory ordering on atomic ops, could actually not be fully provided on an architecture as mainstream and concurrency-oriented as POWER.
To me, this suggests that SeqCst load/store guarantees are hard to provide at the hardware level, and that it is totally possible that less concurrency-focused hardware cannot support those language-level semantics at all either.
(OTOH, SeqCst fence semantics are usually easier to implement in hardware via "full fences", and it's probably possible to reimplement many algorithms that were initially expressed using SeqCst loads and stores using these fences)
do you mean that if appropriate memory barriers are inserted in the right place, it is possible to ensure that if CPU core 1 writes to a first memory location, and CPU core 2 writes to a different memory location, all CPU cores in the system that read from both locations will see these two writes occuring in the same order ?
I mean that thumbv6 allows using any Ordering
, including SeqCst
, in store
and load
operations without issue. If the implementation of that turns out to be incorrect, then that's a bug in LLVM. Unfortunately Rust's memory model doesn't exist, and Rust's Ordering
is only documented to be "the same as LLVM's".
That's fine since thumbv6 has memory barrier instructions even though it doesn't have atomics.
Rust's memory model doesn't exist, and Rust's Ordering is only documented to be "the same as LLVM's".
For concurrency, Rust documents that it uses C11's memory model. So "same as LLVM's" would actually be factually wrong, where is that quote from?
@HadrienG2 that to me sounds like a mostly separate issue from the one discussed here... even specifying SeqCst accesses is actually really hard it turns out, when the accesses are mixed with release/acquire/relaxed accesses to the same location. I have not heard about this POWER bug but it seems plausible. But I'd consider this a hardware or spec bug; the C11 memory model specifies what we expect correct programs to behave like and it was developed with plenty of input from hardware people.
But, SeqCst
is the odd one out here. Release/acquire loads/stores are almost all the time all you need, and with SC fences added to the mix you can represent almost anything.
For concurrency, Rust documents that it uses C11's memory model. So "same as LLVM's" would actually be factually wrong, where is that quote from?
Interesting! It's from the API docs of Ordering
:
Rust's memory orderings are the same as LLVM's.
@RalfJung Yeah, I agree that we should probably take this SeqCst digression elsewhere. What was IMO most important here is that there are architectures which provide sufficient building blocks for writing mutexes (and therefore blocking synchronization), without providing enough building blocks for supporting Rust's full lock-free atomics operation vocabulary, and we may want to keep the door open for supporting those.
Is there a reason that AtomicU128
is still unavailable on stable?
The names of the atomic integers within core::sync::atomic
have been stably present since 1.34
, but gated behind the still-unstable cfg(target_has_atomic)
flag. This means that crates that import these symbols will fail to compile when built on a new target, as shown in mystor/radium#3.
I am faking the compiler’s atomic awareness in a build script for radium
, (mystor/radium#4), which is a weaker emulation of the target profile information in rustc_target
.
Since the most obvious problem with code portability across targets is now the presence of symbols in core::sync::atomic
, what would it take to stabilize cfg(target_has_atomic = "width")
for consumption solely to indicate whether a symbol exists? I am not well-versed enough in the specifics of what sort of behavior we want to commit to supporting, so I do not want to commit to any other behavior than being able to determine whether a given AtomicX
type exists.
Ideally, I would like to be able to stably use the following
#[cfg(target_has_atomic = "8")]
use core::sync::atomic::{AtomicBool, AtomicI8, AtomicU8};
#[cfg(target_has_atomic = "16")]
use core::sync::atomic::{AtomicI16, AtomicU16};
#[cfg(target_has_atomic = "32")]
use core::sync::atomic::{AtomicI32, AtomicU32};
#[cfg(target_has_atomic = "64")]
use core::sync::atomic::{AtomicI64, AtomicU64};
#[cfg(target_has_atomic = "ptr")]
use core::sync::atomic::{AtomicPtr, AtomicIsize, AtomicUsize};
to guard whether symbols can even be imported, regardless of the behavior of those symbols.
How can we work towards stabilizing some form of cfg
detection for symbol availability?
@myrrlyn https://github.com/rust-lang/rust/issues/64797 might also help with that
@rustbot prioritize
Explanation for humans: This has been sitting idle for a long time and really makes the standard library look incomplete. The documentation here: https://doc.rust-lang.org/std/sync/atomic/index.html#portability
Isn't even accurate. The list of targets there is not comprehensive, and it doesn't explain that its methods that are missing not just types. The suggested workaround of #[cfg(target_arch)]
does not even work because thumbv6m
is not a valid value for target_arch
.
An alternative to this would be to simply make the atomics api available on all supported targets, otherwise this really should be implemented soon.
We discussed this in the T-lang meeting today, and though no specific objections to stabilization were discussed, I did have the concern that this feels like it either:
cfg(accessible)
would, in which case this is not particularly useful and seems like not something we should stabilize.We did not reach any firm conclusions, but my feeling was that we were interested in putting time into cfg accessible rather than into this issue in particular.
That said, if I was wrong in the assumption that this is a 1:1 with cfg(accessible), then we should discuss further.
Removing nomination for now.
The reason I'm asking is that I have recently heard about a paper by concurrency researchers that I would tend to trust, which claimed that this guarantee, which is pretty much the defining characteristic of SeqCst atomic memory ordering on atomic ops, could actually not be fully provided on an architecture as mainstream and concurrency-oriented as POWER.
@HadrienG2 do you have a reference to this paper?
@tavianator There you go : http://plv.mpi-sws.org/scfix/paper.pdf .
Most helpful comment
Has there been any progress on this since?