rust 🚀 - Tracking issue for #[cfg(target_has_atomic = ...)]

This is still blocked on some reasonable way to talk about platform-specificity in a more granular way than what we currently support.

sfackler on 7 Sep 2016

Is there any reason why AtomicU32 would be hard to stabilize? We need a way to store multi-word bitflags in an atomic thingy. AtomicUsize won't work here since it's too large on 64 bit.

Manishearth on 4 Oct 2016

What do you mean by "too large"?

sfackler on 5 Oct 2016

Let's say I have a 64-bit bucket of flags. I need two atomic u32s to represent it. I can't use two usizes because that's too large on 64 bit.

Alternatively, let's say I have a 32-bit bucket of flags. This is on a struct where size matters. I need a u32, not something larger on 64-bit.

I guess I could cfg() it and abstract over it. That solves the first problem but not the second.

Manishearth on 5 Oct 2016

@Manishearth 16-bit-usize targets would give you 16 bit atomic width. 100% savings over 32bit systems! (and potential loss of data)

There isn’t really much problem stabilising any of these IMO, but you’ll still have to CFG on target_has_atomic="32" in your code.

nagisa on 5 Oct 2016

16-bit-usize targets would give you 16 bit atomic width.

ugh, these exist, right.

What do I have to do to get these stabilized? I'm okay with the cfgs.

I'm also okay with just u8 being stabilized, as long as I can pack it (not sure if this is possible with atomics)

Manishearth on 5 Oct 2016

What do I have to do to get these stabilized?

Any answer on this? cc @alexcrichton

The rfc looks like it's been completely implemented. IMO stabilizing it with target_has_atomic is fine, and we can add further granularity if we need in new rfcs.

Manishearth on 15 Dec 2016

@Manishearth state hasn't changed from before. The libs team is essentially waiting for a conclusion on the scenarios story before stabilizing.

alexcrichton on 16 Dec 2016

That discussion has stalled. Last I checked it seemed to mostly have come to consensus on the general idea? Who are we waiting on for a conclusion here? What would the ETA on this be? How can I help?

It also seems like that's mostly orthogonal to this. We can stabilize this using cfgs, and add scenarios later. They seem to be pretty compatible.

Manishearth on 16 Dec 2016

(After discussing in IRC, it seems like there have been a lot of recent discussions about this, and the libs team is currently moving on pushing this forward.)

Manishearth on 16 Dec 2016

Has there been any progress on this since?

archshift on 15 Jan 2017

👍15

Just FYI: Per [1], Gecko's Quantum CSS project is no longer blocked on this issue.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1336646

bholley on 4 Feb 2017

We can stabilise AtomicU8 and AtomicI8 the same way we have stabilised AtomicBool – by using Atomic{I/U}Size operations everywhere.

nagisa on 7 May 2017

👍1

FWIW, my integer-atomics crate is a stopgap solution for everyone who is stuck on stable for now. However, it emulates using AtomicUsize cmpxchg-loops, so performance is bad compared to genuine atomic operations.

main-- on 15 Jun 2017

Is there still a good reason preventing this from being stabilized?

Amanieu on 23 Jan 2018

Portability is still the concern I believe.

sfackler on 23 Jan 2018

What C++ does here is provide the atomics on a best-effort basis, and those that aren't supported by the target are emulated using a mutex or similar.

The alternative is to just provide the atomic types in the architectures in which they are actually available. I prefer this approach since it allows third-party crates to provide a portable solution in stable, but doesn't require us to do so in core.

It also seems that progress on scenarios is dead?

gnzlbg on 2 Feb 2018

👍3

I think scenarios became https://github.com/rust-lang/rfcs/pull/1868, so these would be #[cfg]'d but they'd be part of the mainstream configs so most people could use them without getting warnings or doing anything special?

scottmcm on 20 Feb 2018

Is anyone working on finishing the implementation of this RFC? If not, I would like to give it a try and would be looking for a mentor.

gnzlbg on 20 Jun 2018

👍3

The atomic types are currently already guarded by cfg(target_has_atomic) so there shouldn't be any trouble integrating this with a portability lint in the future.

Amanieu on 20 Jun 2018

I was mostly interested on the parts of the original RFC that are not implemented yet, like the 128bit atomic types. I wanted to have an atomic pair of usizes, and can't really do so portably without those types.

gnzlbg on 20 Jun 2018

See #39590 and #38959

Amanieu on 20 Jun 2018

👍1

I think the ATOMIC_*_INIT constants of the not-yet stabilized atomics should be removed before stabilization, they're a crutch from a time without stable const functions.

tbu- on 12 Sep 2018

👍3 ❤1

Can we get an update, what are the current blockers for stabilization?

stjepang on 6 Nov 2018

This is something I'm quite eager for, and if there's any work that could be shared, I would be happy to take some on to get this out the door.

spacejam on 22 Nov 2018

AFAIK the main blocker here for stabilization is that these are not platform-agnostic types. Almost everything in the standard library is available on all platforms in one way or another, and these would be the first additions to libstd in a non-platform-specific location that aren't available on some "upper tier" platforms.

We in the standard library do not have a story of how to provide access these types while maintaining the platform portability guarantees of the standard library. The "portability lint" was supposed to unblock this in theory. Effort has stalled out on that and this, however.

I believe there aren't any technical issues blocking this, only issues around how we expose these APIs in libstd and where we expose them.

alexcrichton on 26 Nov 2018

AFAIK the main blocker here for stabilization is that these are not platform-agnostic types. Almost everything in the standard library is available on all platforms in one way or another, and these would be the first additions to libstd in a non-platform-specific location that aren't available on some "upper tier" platforms.

Could you elaborate about what exactly is the problem here? We already have the unsigned atomic integer types, and AFACT other types of the RFC can be implemented for all platforms as a "thin" wrapper over those, e.g., the AtomicIX can be implemented on top of the AtomicUX types, the AtomicBool type on top of AtomicU8, and the pointer types on top of AtomicUsize.

gnzlbg on 26 Nov 2018

The AtomicUX types are not stable.

sfackler on 26 Nov 2018

@gnzlbg currently the existence of AtomicU8 is a promise that the platform actually has instructions which operate on just one byte as opposed to implementing some form of emulation and/or fallback in libstd. In that sense not all architectures have all the types (notably AtomicU64 is lacking on a few).

alexcrichton on 26 Nov 2018

maybe we could move some of these to std::arch ?

gnzlbg on 26 Nov 2018

AtomicBool is already stable so we already promise we can support one byte atomic operations so there's no reason we can't stabilise AtomicU8 and AtomicI8 now. 16 bit atomics could be stabilised as well because AtomicUsize is stable and usize is at least 16 bits. I would even argue that we can stabilise 32 bit atomics because it's only 16 bit platforms that might not support them and they probably have bigger portability concerns than whether or not AtomicU32 and AtomicI32 are available.

ollie27 on 26 Nov 2018

@gnzlbg it's true! That didn't exist when this issue was created, but it seems like a reasonable-ish place to me to put these types if literally the only thing blocking them is the portability part (which I think is the state of play right now)

@ollie27 the part about AtomicBool could be considered an accidental regression from #33579 because target_has_atomic = "ptr" isn't necessarily guaranteed to be the same as target_has_atomic = "8", although it may be for most of our platforms we have today. AtomicBool specifically has been around since 1.0, so our hands are tied there but we can possibly make more proactive decisions about future types.

alexcrichton on 26 Nov 2018

because target_has_atomic = "ptr" isn't necessarily guaranteed to be the same as target_has_atomic = "8"

Why wouldn't it be possible to implement AtomicBool or AtomicU8 in terms of AtomicUsize?

ollie27 on 27 Nov 2018

because target_has_atomic = "ptr" isn't necessarily guaranteed to be the same as target_has_atomic = "8",

A smaller atomic operation can always be synthesized from a larger one using a compare-exchange loop: modify a subset of the larger word, and use compare-exchange to attempt to commit the results, loop until it succeeds.

Amanieu on 27 Nov 2018

@Amanieu I wonder how that works for an [AtomicU8; N]. In a target that only has 32-bit atomic operations, can the AtomicU8 be 8 bits wide or do they have to be larger for that to work? If they are 8 bits wide, how does this work ? I mean, if you are applying 32-bit wide atomic operaitons on the u8s, you would be doing operation on neighborging bytes, might end up reading out-of-bounds, etc. or how is this avoided?

gnzlbg on 27 Nov 2018

@Amanieu Sounds like https://docs.rs/integer-atomics?

The trick seems wildly unsound at first sight (because we're touching memory falling outside the small atomic value), but... is actually not?

stjepang on 27 Nov 2018

@stjepang Yes, that's exactly what I am talking about.

@gnzlbg This is "safe" in practice since a) the 32-bit atomic operation is 4-byte aligned and therefore can't cross a page boundary, and b) the compare-exchange loop guarantees that the neighboring bytes are never modified by the operation: if they were then the compare-exchange would fail and the loop will retry with the new value for the neighboring bytes.

Amanieu on 27 Nov 2018

👍1

@Amanieu That analysis ignores the compiler, and LLVM alias analysis heavily relies on assumptions like getelementptr inbounds and "all accesses are inbounds". Feel free to do this in inline assembly, but if it goes through the optimizer I'd say that's extremely risky.

RalfJung on 27 Nov 2018

👍1

I'd say that's extremely risky.

It's not risky, it is UB, this code should fail to run on miri (see https://github.com/rust-rfcs/unsafe-code-guidelines/issues/2).

gnzlbg on 27 Nov 2018

@RalfJung This transformation is actually done inside LLVM inside in the backend passes where it lowers atomic intrinsics to arch-specific opcodes.

Amanieu on 27 Nov 2018

@Amanieu I think what @RalfJung means is that the optimization passes that run before the backend are allowed to assume that reads/writes out-of-bounds never happen in the input IR and can therefore optimize under this assumption.

Just because generating machine code that reads/writes out-of-bounds is ok for some targets does not imply that writing Rust code that does (or passing LLVM IR that does) is also ok.

Reads out-of-bounds are undefined behavior in Rust, C, C++, and AFAIK LLVM IR as well, so this is very unlikely to ever be ok (but see the corresponding issue: https://github.com/rust-rfcs/unsafe-code-guidelines/issues/2). Whether it is ok to do so via inline assembly, I don't know, but as long as you clobber everything that the inline assembly modifies (including the memory out-of-bounds), that would probably be ok.

gnzlbg on 27 Nov 2018

👍1

But anyways, my point is that any platform that supports one size for atomic operations will automatically support all smaller sizes (even if we need to write custom implementations in inline asm). The only limit is that maximum size that a platform can atomically CAS.

Amanieu on 27 Nov 2018

👍1

This transformation is actually done inside LLVM inside in the backend passes where it lowers atomic intrinsics to arch-specific opcodes.

That changes nothing. Backend passes have different rules for UB, and they do not do the aggressive alias analysis that is performed during the main optimization phase.

Another way to say this is that backend passes operate on a different language, even if it shares the syntax with "main" LLVM IR. Just because something is allowed in assembly or "low-level LLVM IR" doesn't mean it is allowed in "normal LLVM IR", and the rules of the latter are relevant for Rust. Not doing this properly will lead to miscompilations.

RalfJung on 27 Nov 2018

@ollie27 @Amanieu yes it's definitely possible (as the discussion has found) to implement atomics this way. The libs team has historically, however, decided to define the existence of AtomicU8 as "there are native instructions for this". Emulation layers are left for crates.io (like integer-atomics).

As to whether such a strategy is even sound, I'll leave that to others!

alexcrichton on 27 Nov 2018

@alexcrichton I remember the libs team decision being that standard library atomic types would be guaranteed to be lock-free, rather than simply "there are native instructions for this". The lock-free property is important since it guarantees that atomic operations can be safely used to communicate with a signal/interrupt handler.

A non-lock-free implementation of atomics would use a spinlock or mutex to emulate atomicity. However the algorithm that I described above (CAS loop which modifies a subset of a word) is lock-free since it does not block while waiting for other threads. Effectively it just uses a slightly longer instruction sequence than you might expect.

It is a bit late to back down on this decision since we already guaranteed the availability of atomics on older ARM architectures (pre-ARMv6) using this exact algorithm: https://github.com/rust-lang-nursery/compiler-builtins/blob/master/src/arm_linux.rs

Amanieu on 27 Nov 2018

Hm I may be misremembering the libs team decision! I know for sure we want to guarantee lock-free, but I would personally like to also guarantee hardware/platform support. The pre-ARMv6 strategy you pointed out there was accepted because it's linux-specific, and we may have been overly zealous to accept non-u32 aligned ones there. AFAIK it's not stable other than AtomicBool, so I think we can still end up dropping most of those (and maybe on that one platform switch AtomicBool to word-sized.

I'd prefer to ideally not try to corner ourselves into a decision with historical debt, but rather figure out where we want to go and then work backwards from there. I would personally ideally only like to expose AtomicUXX types for a platform if they're actually supported natively one way or another (either via hardware instructions or the OS level support like pre-ARMv6). We can always add more after that and I think it covers the vast majority of use cases. In the meantime integer-atomics can fill any necessary gaps.

alexcrichton on 27 Nov 2018

I guess this is a difference of perspective: I consider a CAS loop to be "using native hardware instructions" since it uses the native atomic CAS instruction, just with a larger word size. In fact many atomic operations are lowered this way: for example on ARM, a fetch_add is lowered into a ldrex / strex loop which retries if the destination cache line has been modified by another thread.

Anyways, I feel that we are going off topic here. The main point of this issue is that we would like to have integer atomic types available on stable rust. There is a vague concern about the portability, but no concrete proposals to address this.

I personally feel that the current system of having exposing #[cfg(target_has_atomic)] to users is sufficient to address this portability concern. If a crate doesn't compile because AtomicU64 is missing on some platform, the cause should be obvious. We could even (later) add a better error message when attempting to import a nonexistent AtomicU64.

Placing integer atomic types in std::arch is strictly worse than what we have now. These types belong in std::sync::atomic alongside AtomicUsize and AtomicBool.

Amanieu on 27 Nov 2018

👍3

IIRC our primary concern was not wanting to end up with AtomicU64 implemented via Mutex<u64> on platforms without 64 bit atomics.

sfackler on 28 Nov 2018

I personally feel that the current system of having exposing #[cfg(target_has_atomic)] to users is sufficient to address this portability concern.

The objective of the portability lint is to help people be aware upfront of the portability of the code they're writing. With only #[cfg(target_has_atomic)], a crate author has to proactively cfg off their APIs and probably test against targets without those atomics to avoid inevitably breaking the less common case.

sfackler on 28 Nov 2018

@Amanieu the historical portability precedent set by libstd is that all modules are available everywhere except those explicitly whitelisted as "may require some #[cfg] trickery". For example std::os is explicitly defined as "this may require some cfgs", and so is std::arch. In that sense the current instantiation of atomic integer types doesn't fit into this story becaues they're not in a module clearly marked "may require cfgs".

The concrete proposal was to move all these types to the arch modules. I'm realizing now though that doesn't work on arm because target features are required to enable atomic types, not necessarily an architecture-specific type. (Same with WebAssembly and x86, atomic types (or some at least) are only available with certain target features enabled).

This is the main blocker today, these types violate our existing portability story. The solution to this was the venerable portability lint, but that's still quite aways off. If we want to short-circuit the portability lint then we need to develop a stabilization strategy that works in tandem with std's portability policy.

alexcrichton on 28 Nov 2018

Then why not simply add std::sync::atomic to the list of modules that require #[cfg] trickery? This seems like the best solution at this point.

This doesn't block the portability lint: the link can simply check that you have wrapped your atomic usage in #[cfg(target_has_atomic)].

Amanieu on 28 Nov 2018

I don't know if this applies to anyone else, but as a user, I'm primarily interested in AtomicU32/AtomicI32, because there are lots of APIs that involve 32-bit atomic values on 64-bit platforms. If every platform has 32-bit CAS and therefore can support these types acceptably, couldn't they be stabilized immediately? :-)

If every platform also has smaller CAS, or if it's deemed acceptable to synthesize smaller atomics using "oversize" CAS, the smaller atomics could be stabilized immediately as well.

Basically it seems like the only truly non-portable case might be AtomicI64/AtomicU64, so perhaps only those types really need to wait for the portability lint to be sorted out. And since all the platforms I care about are 64-bit (I'll never run my proprietary code on 32-bit), I won't miss them because I can just use AtomicUsize/AtomicIsize instead.

willmo on 29 Nov 2018

It's one possible solution, yeah, to simply say types in std::sync::atomic are not platform-agnostic.

I wanted to get a grasp on what concrete portability story we're talking about, and I wasn't aware of any analysis done here recently, so I've run the compiler over a bunch of targets to see what the various sized swap instructions generated and got this table:

| Target | AtomicU8::swap | AtomicU16::swap | AtomicU32::swap | AtomicU64::swap |
|------|--------------------|-----------------------|--------------|---------------------|
| x86_64-unknown-linux-gnu | xchgb | xchgw | xchgl | xchgq |
| x86_64-apple-darwin | xchgb | xchgw | xchgl | xchgq |
| i686-unknown-linux-gnu | xchgb | xchgw | xchgl | cmpxchg8b |
| i586-unknown-linux-gnu | xchgb | xchgw | xchgl | cmpxchg8b |
| arm-unknown-linux-gnueabi | ldrexb | ldrexh | ldrex | ldrexd |
| arm-unknown-linux-gnueabihf | ldrexb | ldrexh | ldrex | ldrexd |
| armv7-unknown-linux-gnueabihf | ldrexb | ldrexh | ldrex | ldrexd |
| mips-unknown-linux-gnu | ll/sc (BUG) | ll/sc (BUG) | ll/sc | N/A |
| mips64-unknown-linux-gnuabi64 | ll/sc (BUG) | ll/sc (BUG) | ll/sc | lld/scd |
| powerpc-unknown-linux-gnu | ?? (BUG?) | ?? (BUG?) | lwarx/stwcx | N/A |
| powerpc64-unknown-linux-gnu | ?? (BUG?) | ?? (BUG?) | lwarx/stwcx | ldarx/stdcx |
| aarch64-unknown-linux-gnu | ldxrb | ldxrh | ldxr | ldxr |
| thumbv6m-none-eabi | (no swap) | (no swap) | (no swap) | N/A |
| thumbv7m-none-eabi | ldm | ldm | ldmda | N/A |
| thumbv7em-none-eabi | ldrexb | ldrexh | ldrex | N/A |

The points of note are:

mips/mips64/powerpc/powerpc64 seem to not have native instructions for 8/16 byte atomics. We expose AtomicU8/AtomicU16 today and LLVM lowers it to ll/sc emulation
mips/powerpc/thumb* targets don't have 64-bit atomics
thumbv6 atomics don't even have swap. IIRC it's just load/store and other more simplistic operations

All types look to be available on all other platforms tested. This doesn't cover architectures like s390x, sparc, wasm, probably some arm variant, etc.

For targets that are "practically up there in their level of support" that's pretty bleak...

I think I would personally push back against simply saying the types are stable as-is today. We have no precedent for these sort of types with varying support across platforms (at least of this prominence and this level of support) being in libstd without a clear warning about portability.

The portability problem has already been gotten wrong with SIMD which has tons and tons of warnings about how platform-specific it is, and this represents yet-another-portability-hazard if it's in such a prominent place as std::sync::atomic.

I'm ok with the solution of moving these types to std::arch myself, however, as that has clear warnings about portability and is I feel the best we can do at this time.

@willmo empirically it looks like AtomicU32 is indeed supported everywhere I tested at least!

alexcrichton on 30 Nov 2018

Regarding your question about ARM architectures: armv5te and thumv6 targets don't support atomics, except that armv5te emulates them with Linux kernel support.

I disagree with your "(BUG)" comments: the ll/sc loop is the standard way of performing sub-word atomic operations on those platforms. It is just a more complicated version of the ll/sc loop used for word-sized atomic operations.

In short there are really only 3 categories for targets:

Targets that support atomic operations up to their word size. (mips/thumbv7/powerpc)
Targets that support atomic operations up to their double their word size. (arm/arm64/x86/x86_64)
Targets that do not support atomic operations at all. (thumbv6, non-linux armv5, avr). These will only provide atomic load/store up to their word size. We only support libcore (and maybe liballoc?) on these targets.

With the last category, we already have a stabilized precedent for variations in support for std::sync::atomic: thumbv6 doesn't support atomics at all. Also I feel that moving atomic types to std::arch::$arch will actually make code less portable. Code using integer atomic will now have to be specialized for every architecture:

#[cfg(target_arch = x86)]
use std::arch::x86::atomic::AtomicU32;
#[cfg(target_arch = x86_64)]
use std::arch::x86_64::atomic::AtomicU32;
// Oops, now this code will only work on x86 despite the fact that it would work
// just as well on ARM, PowerPC, MIPS, etc.

And even then, there is still varying atomic support within an architecture. This is particularly true for ARM, but it is also the case on x86_64: AtomicU128 is support on all x86_64 chips except the earliest ones from AMD.

In conclusion, I don't feel that moving atomic types to std::arch actually solves any problems, and instead introduces new ones. I feel that the current (unstable) situation of having atomic types conditionally available depending on the target is the best approach to take. Look at it this way: if a crate is found not to compile on some architecture due to missing atomic support, an issue will be opened on Github and the problem will be quickly solved.

Amanieu on 30 Nov 2018

Oh sorry yeah by "BUG" I meant that it didn't follow what I assumed to be our contract, that we only provide atomic types which match exactly with the architecture in question, excluding the fact that any smaller atomic operations can be implemented in terms of larger ones. It's fine for that to be a separable question, I don't mean for it to get in the way.

It's true that atomics on ARM are sort of odd! I'm not sure what to really do about that. That being said most of the platforms that don't have atomics are pretty low down on the platform support tiers, so we could relegate them to "unresolved questions" like targets without floats rather than having them block other designs.

It's true that moving these into std::arch would have to have special code per every architecture. A crate on crates.io, however, could reexport a portable interface which does all the multiplexing and has documented fallbacks or options for what fallbacks should do on unsupported platforms.

I personally disagree that std::arch doesn't solve any problems, but I do agree that it creates an ergonomic barrier to use the types. I feel it's clearly signaling that these operations aren't 100% portable as most of the rest of the standard library already is. These are already somewhat niche types so the ergonomics I don't think are so important as AtomicUsize and friends.

Using std::arch, in my mind, is basically entirely centered around:

First, giving access to these types at all
Second, providing access in a way that's idiomatic with the standard library, moving it to a location that clearly signals it's not as portable as the rest of the standard library.

alexcrichton on 3 Dec 2018

From this discussion I conclude that there are basically three tiers of platform support for any atomic operation:

full native support: blazingly fast
loop emulation (when only larger atomics are available): still fast but may slow down a bit if highly contended
mutex emulation: orders of magnitude slower

Two directly conflicting goals are a) portability and b) protecting the programmer from a potential performance footgun (using emulated atomics over a more efficient solution).

I think these very different goals deserve different treatment: While an application targeted towards a machine that only offers 32-bit atomics is almost always better using 32-bit atomics instead of loop-emulated 16-bit atomics (potentially wasting some memory though), I'd much rather have a library fall back to loop emulation than getting a compile error - the slight performance impact is just not worth the despair of having to dive into a foreign codebase in order to fix portability errors.

Mutex emulation is less obvious considering that it can slow down an application by several orders of magnitude. But I would argue that from a portability standpoint ("I'm trying to use someone else's code on my platform") even this is generally acceptable.

Similar to how I don't get linter warnings for dependency crates, this is how I think the compiler should react to different kinds of atomic calls:

Compile type | Native support | Loop emulation | Mutex emulation
-----------------|--------------------|---------------------|----------------------
Local | fine | warning | warning
Dependency | fine | fine | warning

This gives me a heads up if a crate I'm depending on is going to be severely slower while avoiding accidental performance problems from my own code.

main-- on 5 Dec 2018

I generally agree with your sentiment about tiers of atomic support. But note that some popular RISC platforms, such as 32-bit ARM, require all atomics to be implemented using load-linked/store-conditional loops. I would be wary of linting on those, so it feels to me that such a lint should be allow-by-default.

HadrienG2 on 5 Dec 2018

👍1

Even on x86, some atomic operations are implemented using a cmpxchg loop, e.g. fetch_and. Implementing atomic operations using a loop is completely normal and should be treated the same way as native support.

The performance isn't actually the reason why we make a distinction between so called "lock-free" atomics and ones emulated with a mutex. If you are using atomics to share data between main code and a signal/interrupt handler then you must use lock-free atomics, otherwise your code may deadlock. This can happen if the interrupt happens while the mutex used for atomic emulation is locked.

Amanieu on 5 Dec 2018

👍2

Maybe a bit of extra terminology could help here. When atomics are implemented using CAS or LL/SC loops, they are lock-free but not wait free.

In simple terms, lock-freedom means no single thread can block every other thread if 1/it only interacts with them via atomics and 2/the atomics are not used in a loop to implement a higher-level lock.

Wait-freedom, in contrast, means that a thread cannot be infinitely delayed by other threads hammering the atomic in a loop. This is obviously not true of an atomic operation that is implemented via a CAS or LL/SC loop.

Since we cannot provide wait-free atomics on some platforms, we may want to clarify in the documentation that atomics are only guaranteed to be lock-free, not wait-free.

HadrienG2 on 5 Dec 2018

👍5

If #{cfg(accessible(...)] is added I think it will cover all use cases of #[cfg(target_has_atomic="x")], as users can then check for the presence of the atomic types directly.

Then one use case I could think of, if smaller atomics are emulated and therefore always present but target_has_atomic="x" is only true if there is native support, is using it to check for wait-freeness.
But from Amanieu's comment above, that would only tell whether store() is wait-free, so people will need to check that the specific operations they need are wait-free.

Now, accessible(...) hasn't even passed the RFC stage yet, but it seems like a much cleaner solution than exposing a target_has_xxx attribute for each kind of target feature.
Stabilizing the types without either accessible(...) or target_has_atomic="x" being available would still be useful, as people can use target_arch="x" which while less portable, offers much stronger guarantees.

tormol on 7 Dec 2018

I think that documenting that atomics are at least lock-free and possibly wait-free should be enough to move forward. An optional documentation feature that would make this pretty damn perfect: also documenting which platforms/atomic sizes are wait-free.

In general, I completely agree with @Amanieu that CAS loop (AKA lock-free but not wait-free) atomics would be considered "native atomic support" by basically everyone. But going above-and-beyond by documenting this should settle any concerns.

Valloric on 10 Dec 2018

I'm nominating this for discussion at the next libs triage meeting, but to try to make progress on this discussion I'd like to separate out a few points. If others have thoughts on these (or other points), please let me know!

All AtomicXYY, if they exist in libstd, guarantee they are lock-free
AtomicXYY types do not guarantee they are wait-free, and I think this is where we are today
It's unclear (to me at least) what to do about "emulation" of small size atomics using larger-sized atomics. For example emulating AtomicU8 with AtomicU32 operations.
- @alexcrichton has felt that these shouldn't be stabilized, but if necessary they could be packaged on crates.io
- It's not clear, however, that crates.io can provide a sound implementation of AtomicU8 in terms of AtomicU32 (in terms of LLVM guarantees and whatnot). This may only be a valid thing for LLVM's backend code generator to generate
- @Amanieu (and others, I believe) are in favor of allowing AtomicU8 to be implemented in terms of AtomicU32
Where are these types being stabilized? It's likely safe to say that the portability lint isn't happening any time soon. The actual availability of these types looks like this. Should some go in std::sync::atomic? Others in std::arch? All in std::sync::atomic?
What to do about ARM targets? Lower end ones like armv5te (non-linux) and the thumbv6m ones don't have atomics.

The main two options for placement of these types are:

All types are placed in std::arch and exposed as they're available. This likely wouldn't stabilize the target_has_atomic cfg directive.
All types remain in std::sync::atomic and the target_has_atomic cfg directive is also stabilized

The second option is whether or not to stabilize the emulated atomics, and that's just a question of whether the APIs are stabilized or not.

alexcrichton on 10 Dec 2018

AtomicXYY types do not guarantee they are wait-free, and I think this is where we are today

This is fine, keep in mind that C++11 atomics (which we are based on) makes no guarantees about wait-freedom either.

It's unclear (to me at least) what to do about "emulation" of small size atomics using larger-sized atomics. For example emulating AtomicU8 with AtomicU32 operations.

I think it's fine to defer this issue. Currently none of the built-in targets make use of the min-atomic-width attribute. This was only added in #38579 to support the out-of-tree OR1K target, and even then it should be possible to implement an emulation for those in compiler-builtins. In any case, this doesn't block stabilization as you mentioned at the end.

Amanieu on 10 Dec 2018

Ok we've discussed this in a recent @rust-lang/libs triage, and the conclusion was that the proposal to stabilize all these types as-is is probably the way to go. The stabilization would be coupled with documentation updates indicating that these aren't as portable as, say, Add for u8, but they're available on most platforms. Additionally it was concluded that stabilizing smaller-size atomics for platforms that only have larger-size atomics was fine to do.

I believe this is generally the trend of this thread anyway, so I'm going to open a dedicated thread and FCP this for stable

alexcrichton on 12 Dec 2018

🎉3

Ok for those following along here, I've opened a formal proposal for stabilization at https://github.com/rust-lang/rust/issues/56753, feedback of course is always welcome!

alexcrichton on 12 Dec 2018

smaller-size atomics for platforms that only have larger-size atomics

Just to be sure, the encodings of smaller-sized atomics in terms of larger-sized atomics happens by LLVM, as part of LLVM lowering to machine-specific IR or so? I maintain that doing this at the level of Rust, MIR or LLVM IR is illegal because of potential out-of-bounds accesses, and we shouldn't do it.

RalfJung on 12 Dec 2018

@RalfJung correct, that's what convinced me personally that we can't do this on crates.io, which means if we want it at all we need it in the standard library (via LLVM intrinsics). I think we want it, so I'm convinced to put it into libstd :)

alexcrichton on 12 Dec 2018

@RalfJung This lowering is done either within LLVM, or through a function in compiler_builtins.

The latter is currently only used on armv5te-unknown-linux-gnu at the moment, and uses this code. It could be argued that this is UB since intrinsics::atomic_load_unordered could be used to read out-of-bounds data, however this is guaranteed not to fault because it doesn't cross a page boundary.

Amanieu on 12 Dec 2018

@alexcrichton makes sense!

@Amanieu

It could be argued that this is UB

And the argument would be correct :)

is guaranteed not to fault because it doesn't cross a page boundary.

And as in the last N cases we have had this argument (and as I am sure you are aware, but not everybody else might be), that doesn't change anything about this being UB when we are talking about code written in Rust, MIR or LLVM IR. ;) (I am beginning to feel sorry for being so annoying about this, but LLVM is way too smart and getting smarter every day, so I am actively worried that such arguments will blow in our face some day.)

Is this a pattern supported/intended by LLVM? Is there advise from the LLVM devs for how to do this?

Is there any chance of LLVM ever inlining those compiler-builtins functions? Actually even having them in the same translation unit could be enough to cause problems, because LLVM could infer attributes on the functions to propagate information about what they do out to use sites.

One safer alternative would be to use inline assembly to implement such operations, that would most likely exclude any way for LLVM to notice that there are out-of-bounds accesses. But I am not sure if that's an option here.

RalfJung on 12 Dec 2018

The code is more-or-less based on the GCC implementation, which gets away with a normal atomic load.

Would changing the load to a volatile atomic load help in this case?

Amanieu on 12 Dec 2018

To add to what @RalfJung is saying, @Amanieu just because it works at the hardware layer doesn't mean it's UB in LLVM's IR. For example this function:

define i8 @bar() {
start:                                  
  %a = alloca i8                        
  store i8 0, i8* %a                    
  %b = call i8 @foo(i8* %a)             
  ret i8 %b                             
}                                       

define internal i8 @foo(i8*) {          
start:                                  
  %b = getelementptr i8, i8* %0, i32 1  
  %a = load i8, i8* %b                  
  ret i8 %a                             
}

is sort of a simplisitic view but it's guaranteed to never fault because the out-of-bounds load will just load some byte of the return address on the call stacsk or something weird like that. When optimized, however, it yields:

define i8 @bar() local_unnamed_addr #0 {
start:
  ret i8 undef
}

(a showing that this is undefined behavior)

LLVM can't automatically deduce that all instances of this pattern is undefined behavior, in isolation foo optimized just fine. That's why compiler-builtins happens to work, we're forcing LLVM to have less knowledge about the inputs so it just-so-happens that it can't deduce that undefined behavior is happening.

All that's just to say that @RalfJung I think is totally correct here, a crates.io based implementation of smaller-sized atomics with larger-sized atomics I think is just a segfault waiting to happen. LLVM may not even detect it's UB today, but it's definitely UB at the LLVM IR layer (and probably the Rust layer) to read out of bounds on objects. Why exactly it's UB or what exactly happens is always up for grabs which is why it works most of the time, but this is fundamentally why we need LLVM's backend to do the lowering because the IR passes need to see that we're just modifying/loading one byte, not the bytes around it

alexcrichton on 12 Dec 2018

The operations that @Amanieu wants to perform cannot be performed by a programming language generating LLVM-IR directly. Inline assembly appears to be the only way to perform these right now, so we could still expose them _I think_ (@RalfJung ? I don't know whether compiler-builtins would work too).

In the meantime, I think it would be better to open an issue in the LLVM bugzilla about this, explaining why these operations are useful, why the LLVM-IR generated for them has undefined behavior, and how that requires us to use inline assembly (or modify compiler-builtins) instead. We should ask: what should we do? Should we use inline assembly / our own compiler built-ins ? Will LLVM expose intrinsics to allow these safely? etc.

It might be worth mentioning that this is not the only situation in which we need to perform reads out-of-bounds (see https://github.com/rust-rfcs/unsafe-code-guidelines/issues/2).

gnzlbg on 13 Dec 2018

Would changing the load to a volatile atomic load help in this case?

No. Volatile reads in practice have some positive effects on racy reads (but LLVM may change those rules any time as we are relying on de-facto behavior here). It doesn't change anything about the requirement that accesses must be in-bounds.

The proper way to fox this is (as @gnzlbg mentioned) to add an attribute to LLVM that can be set on reads/writes and that indicates that the access may be partially out-of-bounds. Then we need a matching intrinsic in Rust, and methods such as read_out_of_bounds and write_out_of_bounds on pointers. Considering we need this for concurrency, we'd also need to think about how to expose atomic out-of-bounds accesses in Rust. Anything else (anything just arguing based on page boundaries but not informing LLVM) will remain a hack. Given that this seems to be a useful pattern, I absolutely think we should lobby for LLVM to add such an attribute!

just because it works at the hardware layer doesn't mean it's UB in LLVM's IR. For example this function

Thanks for the example, I'll link to this when such discussions come up again in the future. :)

That's why compiler-builtins happens to work, we're forcing LLVM to have less knowledge about the inputs so it just-so-happens that it can't deduce that undefined behavior is happening.

That sounds way less confident than I had hoped...

When does compiler-builtins get linked with the real program? Is there a chance that LTO might inline compiler-builtins functions (which then would mean LLVM could deduce the UB)?

RalfJung on 13 Dec 2018

When does compiler-builtins get linked with the real program? Is there a chance that LTO might inline compiler-builtins functions (which then would mean LLVM could deduce the UB)?

As rtlib calls are only inserted at the SelectionDAG layer, while LTO still operates on LLVM IR, I don't believe there is any possibility of these getting inlined.

nikic on 13 Dec 2018

@nikic is correct, we explicitly don't LTO compiler-builtins as well (it's a very special crate). In that sense there's no worry for inlining compiler-builtins intrinsics.

alexcrichton on 13 Dec 2018

Okay. I can live with that. We should keep it in mind though for the future, if/when compiler-builtins treatment ever changes.

So, yeah, I agree we should go forward with such "emulated" small-int atomics implemented via LLVM lowering or compiler-builtints.

RalfJung on 13 Dec 2018

For those following this thread, the stabilization proposal is now in FCP

alexcrichton on 7 Jan 2019

👍2 🎉1

Can i ask a question (just curious) ? It seems that constants like ATOMIC_I64_INIT are marked as stable since 1.34 and deprecated since 1.34 at the same time. Why stabilize something that is deprecated? It may be just my opinion, but i think that getting new stable feature that is deprecated from the beginning is rather strange...

macpp on 2 Feb 2019

👍1

Nice catch! I think we should just remove those constants.

Amanieu on 2 Feb 2019

That's convincing to me, @macpp -- opened https://github.com/rust-lang/rust/issues/58089 to track it.

scottmcm on 3 Feb 2019

👍1

This is listed as the tracking issue for cfg_target_has_atomic, which is still unstable. Should this be reopened?

jonas-schievink on 18 Mar 2019

👍3

Yep - reopened.

sfackler on 25 Mar 2019

Removing T-Libs since this is a pure language feature.

Centril on 19 Apr 2019

Is there any progress on this? Can anyone explain what cfg_target_has_atomic is blocked on? I.e. what are questions we need to resolve before stabilizing?

LukasKalbertodt on 20 Jul 2019

👍1

AtomicU32 was stabilized for 1.34.0

asomers on 6 Aug 2019

I have one objection to the way target_has_atomic = "cas" works. I would prefer if we split this into two separate cfgs:

target_has_atomic = 8/16/32/64/128: This indicates the largest width that the target can atomically CAS (which implies support for all atomic operations).
target_has_atomic_load_store = 8/16/32/64/128: This indicates the largest width that the target can support loading or storing atomically (but may not support CAS).

(bikeshed: maybe a slightly shorter name target_has_atomic_ldst)

Amanieu on 6 Aug 2019

Is CAS the only operation that we'd need to call out that way (e.g. are there any platforms we care about that have atomic load/store but not swap)?

It seems like we should be able to stabilize target_has_atomic itself though with @Amanieu's definition.

sfackler on 7 Oct 2019

thumbv6 has load, store, but no swap or cas

jonas-schievink on 7 Oct 2019

Does thumbv6 have any kind of read-modify-write instruction? Maybe presence or absence of atomic RMW instructions could be the right discrimination criterion...

HadrienG2 on 7 Oct 2019

No, thumbv6 has nothing of the sort. Perhaps a better name would be #[cfg(target_has_atomic = "rmw")], but that still doesn't really capture the swap operation.

jonas-schievink on 7 Oct 2019

Why ? Swap reads the old value, replaces it with the new one, and writes that in a single atomic transaction, so it is RMW in my book.

HadrienG2 on 8 Oct 2019

👍1

cc #65214

Amanieu on 8 Oct 2019

Why ? Swap reads the old value, replaces it with the new one, and writes that in a single atomic transaction, so it is RMW in my book.

Yeah, fair point.

jonas-schievink on 8 Oct 2019

Does thumbv6 have any kind of read-modify-write instruction? Maybe presence or absence of atomic RMW instructions could be the right discrimination criterion...

I think a CAS cfg is correct because all the other RMW operations can be implemented with it, but having one RMW operation like swap doesn't allow you to implement the rest. So, for targets that just have a swap,fetch_add etc., but not CAS we might need more cfgs, but I don't think it would add enough value to be worth it.

parched on 9 Oct 2019

Good point! I think we can agree on the following conclusion:

CAS is special because, like LL/SC, it has an infinite consensus number. This makes it the most general kind of atomic instruction, able to emulate any other atomic op _in a lock-free (but not wait-free) manner_.
If we find hardware which supports e.g. test-and-set but not CAS, then we may want support it as well with a finer-grained cfg (e.g. target_has_atomic_test_and_set), because test-and-set is all you need to implement a mutex, and a mutex is all you need to emulate any other atomic instruction _in a blocking manner_.

Since Rust atomics are guaranteed to be at least lock-free, this substitution cannot be done silently by std and must be performed manually on the user's side. Therefore, it is not transparent and must be exposed by a cfg, if and when the situation arises.

All this is conditional on the existence of hardware which has some atomic RMW instructions, but none with infinite consensus number. I'm not personally aware of any, but embedded chips and legacy hardware are full of surprises so it's best to keep that door open at the syntax level.

I believe that the syntax proposed by @Amanieu (target_has_atomic vs target_has_atomic_load_store) does so, therefore I'm happy with it.

HadrienG2 on 9 Oct 2019

If we find hardware which supports e.g. test-and-set but not CAS, then we may want support it as well with a finer-grained cfg (e.g. target_has_atomic_test_and_set), because test-and-set is all you need to implement a mutex

However, "normal" sequentially consistent loads and stores are also sufficient to implement a Mutex using Dekker's algorithm (for 2 CPUs) or Peterson's algorithm (for any number of CPUs). Now I wonder, how does that fit in?

RalfJung on 9 Oct 2019

All this is conditional on the existence of hardware which has some atomic RMW instructions, but none with infinite consensus number. I'm not personally aware of any, but embedded chips and legacy hardware are full of surprises so it's best to keep that door open at the syntax level.

Some old ARM chips (~ARMv5) only have an atomic SWP instruction and nothing else. However neither GCC nor LLVM actually use this instruction for atomics so atomics are unsupported on this architectures.

IMO we should follow the same general policy: only support atomic operations if all of them are supported (which essentially boils down to whether CAS is supported since you can use it to emulate the others).

Amanieu on 9 Oct 2019

Having access to limited atomic operations might still be useful for some niche applications (eg. on ARM7TDMI, which is still somewhat widespread), so I think it would be unfortunate if these use cases are prevented by a matter of policy.

jonas-schievink on 9 Oct 2019

If we find hardware which supports e.g. test-and-set but not CAS, then we may want support it as well with a finer-grained cfg (e.g. target_has_atomic_test_and_set), because test-and-set is all you need to implement a mutex

However, "normal" sequentially consistent loads and stores are also sufficient to implement a Mutex using Dekker's algorithm (for 2 CPUs) or Peterson's algorithm (for any number of CPUs). Now I wonder, how does that fit in?

Given that...

the number of cores of the target CPU is rarely known at compile time, which is pretty much a prerequisite for efficiently implementing those algorithms,
if a hardware architecture cares so little about concurrency that it does not even expose a test-and-set instruction, it is unlikely to provide the required memory barriers for SeqCst ordering,

I don't think that these algorithms are applicable outside of very constrained embedded scenarios where the target hardware is exactly known and hardware portability is not desired at all.

HadrienG2 on 9 Oct 2019

if a hardware architecture cares so little about concurrency that it does not even expose a test-and-set instruction, it is unlikely to provide the required memory barriers for SeqCst ordering,

Even thumbv6 has fully working loads and stores, despite it not having anything more sophisticated than that (no swap, CAS, or anything else). These are still sufficient for implementing things like SPSC queues.

Thumbv6 is also used in multicore processors, often alongside a more powerful Cortex-M3/M4 core (which is thumbv7 and does have CAS, etc.). This means that implementing a Mutex using one of the algorithms Ralf linked above might actually make sense on these MCUs. Manufacturers of these MCUs also provide peripherals that provide synchronization primitives, but these are often specific to the MCU family and don't exist on others.

jonas-schievink on 9 Oct 2019

👍1

@jonas-schievink By "fully working loads and stores", do you mean that if appropriate memory barriers are inserted in the right place, it is possible to ensure that if CPU core 1 writes to a first memory location, and CPU core 2 writes to a different memory location, all CPU cores in the system that read from both locations will see these two writes occuring in the same order ?

The reason I'm asking is that I have recently heard about a paper by concurrency researchers that I would tend to trust, which claimed that this guarantee, which is pretty much the defining characteristic of SeqCst atomic memory ordering on atomic ops, could actually not be fully provided on an architecture as mainstream and concurrency-oriented as POWER.

To me, this suggests that SeqCst load/store guarantees are hard to provide at the hardware level, and that it is totally possible that less concurrency-focused hardware cannot support those language-level semantics at all either.

(OTOH, SeqCst fence semantics are usually easier to implement in hardware via "full fences", and it's probably possible to reimplement many algorithms that were initially expressed using SeqCst loads and stores using these fences)

HadrienG2 on 10 Oct 2019

do you mean that if appropriate memory barriers are inserted in the right place, it is possible to ensure that if CPU core 1 writes to a first memory location, and CPU core 2 writes to a different memory location, all CPU cores in the system that read from both locations will see these two writes occuring in the same order ?

I mean that thumbv6 allows using any Ordering, including SeqCst, in store and load operations without issue. If the implementation of that turns out to be incorrect, then that's a bug in LLVM. Unfortunately Rust's memory model doesn't exist, and Rust's Ordering is only documented to be "the same as LLVM's".

jonas-schievink on 10 Oct 2019

That's fine since thumbv6 has memory barrier instructions even though it doesn't have atomics.

Amanieu on 10 Oct 2019

Rust's memory model doesn't exist, and Rust's Ordering is only documented to be "the same as LLVM's".

For concurrency, Rust documents that it uses C11's memory model. So "same as LLVM's" would actually be factually wrong, where is that quote from?

@HadrienG2 that to me sounds like a mostly separate issue from the one discussed here... even specifying SeqCst accesses is actually really hard it turns out, when the accesses are mixed with release/acquire/relaxed accesses to the same location. I have not heard about this POWER bug but it seems plausible. But I'd consider this a hardware or spec bug; the C11 memory model specifies what we expect correct programs to behave like and it was developed with plenty of input from hardware people.

But, SeqCst is the odd one out here. Release/acquire loads/stores are almost all the time all you need, and with SC fences added to the mix you can represent almost anything.

RalfJung on 10 Oct 2019

For concurrency, Rust documents that it uses C11's memory model. So "same as LLVM's" would actually be factually wrong, where is that quote from?

Interesting! It's from the API docs of Ordering:

Rust's memory orderings are the same as LLVM's.

jonas-schievink on 10 Oct 2019

@RalfJung Yeah, I agree that we should probably take this SeqCst digression elsewhere. What was IMO most important here is that there are architectures which provide sufficient building blocks for writing mutexes (and therefore blocking synchronization), without providing enough building blocks for supporting Rust's full lock-free atomics operation vocabulary, and we may want to keep the door open for supporting those.

HadrienG2 on 10 Oct 2019

👍3

Is there a reason that AtomicU128 is still unavailable on stable?

bbqsrc on 9 Jun 2020

👍2

The names of the atomic integers within core::sync::atomic have been stably present since 1.34, but gated behind the still-unstable cfg(target_has_atomic) flag. This means that crates that import these symbols will fail to compile when built on a new target, as shown in mystor/radium#3.

I am faking the compiler’s atomic awareness in a build script for radium, (mystor/radium#4), which is a weaker emulation of the target profile information in rustc_target.

Since the most obvious problem with code portability across targets is now the presence of symbols in core::sync::atomic, what would it take to stabilize cfg(target_has_atomic = "width") for consumption solely to indicate whether a symbol exists? I am not well-versed enough in the specifics of what sort of behavior we want to commit to supporting, so I do not want to commit to any other behavior than being able to determine whether a given AtomicX type exists.

Ideally, I would like to be able to stably use the following

#[cfg(target_has_atomic = "8")]
use core::sync::atomic::{AtomicBool, AtomicI8, AtomicU8};

#[cfg(target_has_atomic = "16")]
use core::sync::atomic::{AtomicI16, AtomicU16};

#[cfg(target_has_atomic = "32")]
use core::sync::atomic::{AtomicI32, AtomicU32};

#[cfg(target_has_atomic = "64")]
use core::sync::atomic::{AtomicI64, AtomicU64};

#[cfg(target_has_atomic = "ptr")]
use core::sync::atomic::{AtomicPtr, AtomicIsize, AtomicUsize};

to guard whether symbols can even be imported, regardless of the behavior of those symbols.

How can we work towards stabilizing some form of cfg detection for symbol availability?

myrrlyn on 15 Sep 2020

@myrrlyn https://github.com/rust-lang/rust/issues/64797 might also help with that

RalfJung on 15 Sep 2020

@rustbot prioritize
Explanation for humans: This has been sitting idle for a long time and really makes the standard library look incomplete. The documentation here: https://doc.rust-lang.org/std/sync/atomic/index.html#portability
Isn't even accurate. The list of targets there is not comprehensive, and it doesn't explain that its methods that are missing not just types. The suggested workaround of #[cfg(target_arch)] does not even work because thumbv6m is not a valid value for target_arch.
An alternative to this would be to simply make the atomics api available on all supported targets, otherwise this really should be implemented soon.

tkaitchuck on 22 Oct 2020

We discussed this in the T-lang meeting today, and though no specific objections to stabilization were discussed, I did have the concern that this feels like it either:

does the wrong thing in practice (i.e., does not match libcore's atomic types existing)
does precisely the same thing as cfg(accessible) would, in which case this is not particularly useful and seems like not something we should stabilize.

We did not reach any firm conclusions, but my feeling was that we were interested in putting time into cfg accessible rather than into this issue in particular.

That said, if I was wrong in the assumption that this is a 1:1 with cfg(accessible), then we should discuss further.

Removing nomination for now.

Mark-Simulacrum on 26 Oct 2020

The reason I'm asking is that I have recently heard about a paper by concurrency researchers that I would tend to trust, which claimed that this guarantee, which is pretty much the defining characteristic of SeqCst atomic memory ordering on atomic ops, could actually not be fully provided on an architecture as mainstream and concurrency-oriented as POWER.

@HadrienG2 do you have a reference to this paper?

tavianator on 5 Nov 2020

@tavianator There you go : http://plv.mpi-sws.org/scfix/paper.pdf .

HadrienG2 on 5 Nov 2020

👍1

Rust: Tracking issue for #[cfg(target_has_atomic = ...)]

Most helpful comment

All 113 comments

Related issues