Runtime: API Proposal: Add Interlocked ops w/ explicit memoryOrder

Created on 7 May 2018  Β·  61Comments  Β·  Source: dotnet/runtime

While trying to stabilize the thread pool for linux-arm64 during the release 2.1 effort, it became apparent that the safest thing would be to assume that existing code assumed an interlocked operation guaranteed barrier to enforce sequential consistency at least with respect to operations before and after the interlocked operations.

While this approach is likely to guarantee functional correctness in the most legacy code, it does come at a significant cost to weakly ordered machines. Also it is actually rare that an Interlocked operation would actually need to guarantee sequential consistency.

This proposal adds a MemoryOrder parameter to each atomic interlocked operation.

The proposal currently does not show the MemoryOrder parameter with a default MemoryOrder memoryOrder = SequentiallyConsistent because those API already exist and can be presumed to continue to exist in order to support NetStandard2.1 and earlier.

namespace System.Threading
{
    public enum MemoryOrder
    {
        SequentiallyConsistent,
        AcquireRelease,
        Release,
        Acquire,
        Consume,
        Relaxed
    }


    public static partial class Interlocked
    {
        public static int Add(ref int location1, int value, MemoryOrder memoryOrder);
        public static long Add(ref long location1, long value, MemoryOrder memoryOrder);
        public static double CompareExchange(ref double location1, double value, double comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static int CompareExchange(ref int location1, int value, int comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static long CompareExchange(ref long location1, long value, long comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static IntPtr CompareExchange(ref IntPtr location1, IntPtr value, IntPtr comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static object CompareExchange(ref object location1, object value, object comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static float CompareExchange(ref float location1, float value, float comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static T CompareExchange<T>(ref T location1, T value, T comparand, MemoryOrder memoryOrderSuccess, MemoryOrder memoryOrderFail);
        public static int Decrement(ref int location, MemoryOrder memoryOrder);
        public static long Decrement(ref long location, MemoryOrder memoryOrder);
        public static double Exchange(ref double location1, double value, MemoryOrder memoryOrder);
        public static int Exchange(ref int location1, int value, MemoryOrder memoryOrder);
        public static long Exchange(ref long location1, long value, MemoryOrder memoryOrder);
        public static IntPtr Exchange(ref IntPtr location1, IntPtr value, MemoryOrder memoryOrder);
        public static object Exchange(ref object location1, object value, MemoryOrder memoryOrder);
        public static float Exchange(ref float location1, float value, MemoryOrder memoryOrder);
        public static T Exchange<T>(ref T location1, T value, MemoryOrder memoryOrder) where T : class;
        public static int Increment(ref int location, MemoryOrder memoryOrder);
        public static long Increment(ref long location, MemoryOrder memoryOrder);
        public static void MemoryBarrier(MemoryOrder memoryOrder);
    }
api-needs-work area-System.Threading

Most helpful comment

Your usage of Lazy allocates too much

I didn't suggest using Lazy. I suggested using LazyInitializer.EnsureInitialized. It does not allocate too much. You might want to review the implementation to confirm for yourself:
https://github.com/dotnet/coreclr/blob/b12c344020ba4cc5bccff377c8922f5434aa293e/src/System.Private.CoreLib/shared/System/Threading/LazyInitializer.cs#L50-L51

you have not spent any effort considering this use case

I'm sorry you feel that way. I disagree. I believe I've spent a lot of time considering these use cases and your arguments. Just because I don't agree with you doesn't mean I haven't considered your position.

there was no good faith discussion of the issues I have raised

Again, I'm sorry you feel that way. I very much disagree.

All 61 comments

@kouvel @stephentoub @CarolEidt @eerhardt @RussKeldorph @jkotas

MemoryOrder was copied from dotnet/runtime#17975

Created based on discussion in https://github.com/dotnet/coreclr/pull/17567#issuecomment-381341865 and surrounding

C++11 atomics seem to have separate memory order in compare_exchange for success case and failure case. I think it would be good to separate them when we're being explicit about the ordering anyway.

Related proposal Atomic<T> which has memory order https://github.com/dotnet/corefx/issues/10481

Please do not pollute this moderately-high-level API, create a separate one (potentially in nested class).

A new overload will come up in intellisense and lead developers astray.

Cutting edge low-level gurus who DO understand differences between those options would do fine with typing few more characters.

Are all of the proposed values in MemoryOrder actually useful? I don't actually have an opinion, I'm just asking to make sure we're not blindly copying from C++.

Please do not pollute this moderately-high-level API, create a separate one (potentially in nested class).

This overload logically belongs with interlocked operations. The arguments are already implicitly present. This allows code to be explicit. It documents the required ordering. I do not believe it is polluting.

A new overload will come up in intellisense and lead developers astray.

The fact that it comes up in intellisense is a good thing. Developers should make a deliberate decision about what is intended by each Interlocked operation in terms of memory ordering. I would argue hiding this leads them astray.

Cutting edge low-level gurus who DO understand differences between those options would do fine with typing few more characters.

If a developer is not wiling to think about memory ordering, they should probably be using higher level abstractions. (locking, and or thread safe queues...).

Are all of the proposed values in MemoryOrder actually useful? I don't actually have an opinion, I'm just asking to make sure we're not blindly copying from C++.

All the orderings are useful.

I think it would be good to separate them when we're being explicit about the ordering anyway.

Done

@sdmaclea If a developer is not wiling to think about memory ordering, they should probably be using higher level abstractions. (locking, and or thread safe queues...)

That is: (a) demonstrably not true, (b) uncooperative to the community already using the platform, (c) show narrow focus on your specific use case, to the exclusion of platform strategic goals.

(a) the existing API has been used for 18+ years β€” without excessive fine-grained detail.

The original PR rationale explicitly mentions that current conservative model. That was enough for 18 years, lots of people built solid software expecting conservative model that is simplistic and easy to reason about.

Your suggestion the current conservative model is somehow flawed, that Interlocked.* must only be used with full knowledge of more subtle error-prone models β€” that suggestion is incorrect.

(b) existing C# developer community consists of much more than performance-focused tech folks. For a huge * majority * of C# developers conservative memory model of Interlocked.* is as low a level as they will ever need to get. This API change alongside the current already low-level API will add noticeable risk to that many people, with absolute zero upside to them.

(c) the fact that * you * understand fine-grained memory model differences does harm your ability to recognise risk here. It's as if I make speaking Ukrainian mandatory for coding in C# β€” I mean it's sooooo easy to me, everybody should learn too.


To demonstrate my core point, show ArrayList.Sync property getter code (from .NET Framework) to a C# developer and see how many of them will be able to guess the right memory model to use instead of conservative one.

Suggesting API instead:

Interlocked.Ordered.Increment(ref this.count, Interlocked.Ordered.MemoryOrder.Relaxed);

Require people to type that extra .Ordered so it sits logically together with existing API, but it's impossible to stumble into hot water without looking.

@mihailik No disrespect for your opinion was meant. Since you stated your opinion, it was necessary to state my opposing opinion.

The original PR rationale explicitly mentions that current conservative model. That was enough for 18 years, lots of people built solid software expecting conservative model that is simplistic and easy to reason about.

It was probably enough because of the platforms it was running on. It seems in most cases the strongest memory order is not actually necessary.

For the API it seems to me like an overload is the natural thing to do and is also what is done elsewhere for more fine-grained control, so it's consistent. It also allows people to discover the overload and think about what they need, and there are options for those who want to keep it simple or don't need to optimize to that degree.

I agree with @kouvel that an overload is the natural thing to do. The fact that ordering has not been a necessary feature of existing APIs doesn't imply that it is not required - especially as we move to support weaker memory models and systems with increasing levels of parallelism.
I also think that having the overload without the ordering parameter use the conservative ordering should make it easier for developers who are unsure whether they require ordering - when in doubt, the simpler overload is probably what will be chosen.

FYI, I've just come across the RelaSharp project by @nicknash which has implemented something very similar to this API, see it's Generic Interface (it also does a lot more, the whole library looks very cool!)

It may be useful to take a look at how RelaSharp has done things, to see what can be learnt?

That library seems to be a simulator/checked, not an actual implementation of interlocked operations with specific memory ordering.

You can't quite implement these without runtime support, the existing ones are intrinsics after all. There's also the pesky case of MemoryOrder.Consume - that may deserve a separate discussion as it requires special JIT support.

There's also the pesky case of MemoryOrder.Consume - that may deserve a separate discussion as it requires special JIT support.

I think it's good to include it in MemoryOrder, for consistency and "future proofing", but I presume that we would go the route of the C++ compilers and implement it as Acquire at least for now.

@sdmaclea Don't you need Interlocked.MemoryBarrier(MemoryOrder memoryOrder) as well? Currently it generates dmb ish but I get from the ARM64 documentation that there's also dmb ishld that could be used for MemoryOrder.Acquire. There's dmb ishst but that doesn't seem to map to any of the existing orderings as it serialized only stores.

C# public enum MemoryOrder { Relaxed, Consume, Acquire, Release, AcquireRelease, SequentiallyConsistent }
Maybe it would be better to reverse the order of this enum's members to better suggest that SequentiallyConsistent is the default.

People promoting these overloads and people who would be pained by it are very disconnected. @sdmaclea while you may not disrespect those people intentionally, you eject their needs from consideration too easily.

For majority of the existing C# code using Interlocked.* β€” these overloads add noticeable maintenance burden. Instead pushing them into a nested avoids the risk at no extra cost to those who code low-level algos.

Consider ArrayList.SyncRoot that exists in its current form since .NET 1.1, and replicated verbatim in both Mono and .NET Core List:

```c#
public virtual Object SyncRoot {
get {
if( _syncRoot == null) {
System.Threading.Interlocked.CompareExchange(ref _syncRoot, new Object(), null);
}
return _syncRoot;
}
}

What is the best `MemoryOrder` to use? How do you validate that? What share of C# developers can confidently pick the right one?

With your new overloads, what will happen is (as @kouvel said) `It also allows people to discover the overload` β€” a good professional C# developer with no C++ baggage will try to improve **this kind of code** for ARM64.

The downside risk is enormous, as anybody with non-trivial multithread debugging experience knows. Race condition due to incorrect memory ordering tends to manifest itself rarely, under narrow environment conditions β€” making it extremely hard to diagnose and debug. People will read an article, pick an option, test it thoroughly on their machine β€” and 2 month later corrupt data and sink heaps of resources tracking the root cause.

That's the downside for an average business C# developer.

Now the upside is (potentially) faster code. In certain conditions on certain hardware.

I appreciate these memory modes may be familiar to C++ devs, and may be very cool and awesome β€” but they are unbounded downside risk for the majority of C# developers.

That is why the feature should be very cool and useful, but not advertise ultra-low-level features to normal-level C# developers for quick try.

I have suggested a syntax that achieves exactly that goal. Sure, it's 8 character more to type for performance/hardware-minded folks, but that's only fair. Keep the knives away from children.

```c#
// current API
Interlocked.Increment(ref this.count);

// proposed overload
Interlocked.Increment(
  ref this.count,
  MemoryOrder.Relaxed);

// improved API
Interlocked.Ordered.Increment(
  ref this.count,
  Interlocked.Ordered.MemoryOrder.Relaxed);

Do you see how it would still participate in code completion and easy typing? Yet it is way more explicit and self-documenting in any code review? That's intentional.

For the higher-level reasoning about this kind of tough API cases, please refer to the famous Pit of Success idea. API should guide people towards best practice, not just serve as a broad menu of options.

@mihailik, I hear your concerns, but I also agree with everyone who's suggested these be overloads. It is very common in .NET for overloads to provide additional arguments that allow for someone who knows what they're doing to express more control over the behavior of the operation. Even something as simple as int.Parse, for example, does this. There's the basic int.Parse(string), but if you know what you're doing you can provide a NumberStyle as an argument to tweak how the parsing operation behaves. This is no different. The simple overload of Interlocked.Increment, for example, provides the general behavior we all know and use today, but if you know what you're doing and want more control, you can opt-in to explicitly providing additional arguments to tweak the behavior. I actually think your proposal would lead to more confusion and a higher liklihood that developers who "shouldn't" be using the options would.

@stephentoub the difference is that incorrect NumberFormat is easy to notice, easy to debug, trivial to test.

Incorrect MemoryOrder is extremely hard to notice, test and endless torture to debug.

BTW, no attachment to specific syntax β€” as long as the concern is taken into account.

BTW would you mind using ArrayList.SyncRoot as a straw man here: suggest an improvement with new MemoryOrder and let's see how hard it is to reason about it.

I put it to you, that a general C# developer will not be able to code-review such change.

Incorrect MemoryOrder is extremely hard to notice, test and endless torture to debug.

It's trivial to spot in a code review or in code where it shouldn't be. And if you want to outlaw it in a codebase, it's trivial to do so with a Roslyn analyzer. It also takes very explicit action to use: if a developer doesn't know what it means, why would they explicitly type ", MemoryOrder.Acquire"? And if they believe they know what it means and are desiring to use it, then it doesn't matter how it's exposed, they're going to try to use it.

BTW would you mind using ArrayList.SyncRoot as a straw man here: suggest an improvement with new MemoryOrder and let's see how hard it is to reason about it.

I do not understand the comparison. SyncRoot is very high level as far as synchronization goes; if you're using that, you're not going to be replacing it with any usage of Interlocked, regardless of this new MemoryOrder argument. If you are replacing it, you better know what you're doing, going from lock-based code to lock-free code, again regardless of this whole proposal.

Don't you need Interlocked.MemoryBarrier(MemoryOrder memoryOrder) as well?

Using the memory order enumeration for MemoryBarrier didn't make any sense to me, so I punted.

For Interlocked.MemoryBarrier, something like this makes more sense to me.

Interlocked.MemoryBarrier(bool orderLoads, bool orderStores = true)

System.Threading.Volatile needs memory ordering overloads too.

In my mind there are two options

Volatile.Read(*, MemoryOrder order)
Volatile.Read<T>(*, MemoryOrder order)
Volatile.Write(*, MemoryOrder order)
Volatile.Write<T>(*, MemoryOrder order)
Volatile.ReadUnordered(*)
Volatile.ReadUnordered<T>(*)
Volatile.WriteUnordered(*)
Volatile.WriteUnordered<T>(*)

I like the flexibility and extensibility of the first set, but not the complexity.

Perhaps an alternative would be to create several use case specific enumerations
AtomicMemoryOrder
VolatileMemoryOrder (or ReadMemoryOrder & WriteMemoryOrder)
BarrierMemoryOrder

This would allow the different use cases to be separated and unsupported/nonsensical cases easily detected by the compiler.

For the memory order on volatile, is it just to allow preventing compiler optimizations without an actual barrier (which I think is missing currently), or is there something else?

For the memory order on volatile, is it just to allow preventing compiler optimizations without an actual barrier (which I think is missing currently), or is there something else?

That was my initial concern.

There are potentially cases where a memory barrier before/after/both are needed. I was thinking we would need to fuse them in the API to get them to be atomic. However, since they will eventually map to intrinsic, I think JIT can fuse the memory barriers to Volatile operations in Lowering just fine.

So I think we just need to add Volatile unordered support.

Let me look at the arm64 and arm32 instruction sets and make sure my assessment above is correct. I suspect I am wrong at least with respect to some of the newer instruction set extensions.

For preventing compiler optimizations (maybe this should be a different review?) how about adding acquire/release versions of compiler-only barriers separately from volatile so that it doesn't have to be tied to a load/store?

Oh right volatile also prevents tearing, maybe memory order on volatile would suffice for now

Incorrect MemoryOrder is extremely hard to notice, test and endless torture to debug.

@stephentoub:

It's trivial to spot in a code review or in code where it shouldn't be.

How?? Can you please be more specific?

Imagine ArrayList.SyncRoot was optimised for better performance:

c# public virtual Object SyncRoot { get { if( _syncRoot == null) { System.Threading.Interlocked.CompareExchange<Object>( ref _syncRoot, new Object(), null, MemoryOrder.Acquire, MemoryOrder.AcquireRelease); } return _syncRoot; } }

Are you seriously saying validity here is 'trivial to spot'???

17+ years of writing C# code on business side, I know barely 3 people equipped to get this fine detail right, and I wouldn't bet any one of them can do it in under 10 minutes of mental algebra.

Are there any other advocates for business-level developers involved in BCL API design? I'm seriously baffled why the case is argued completely one-sided from perf/hardware side. Please please find someone with field experience away from the compiler/JIT/GC. Your low-level glasses are skewing the picture to you, and you're going to harm the platform, and the community.

Again, here's a very clear example above. Anybody wanna chip in and 'trivially spot' the right MemoryOrder here in ArrayList.SyncRoot? Measure the time it took you to work out this 'trivial' answer -- and have a guess how long will it take a business developer.

Please don't turn C# into another C++, consider people out there. We have our day jobs, and it's got nothing to do with squeezing an odd nanosecond out of cutting edge ARM64 instruction set.

Hope that helps!

@mihailik
When the above patch is reviewed
You will see

-                ref _syncRoot, new Object(), null);
+                ref _syncRoot, new Object(), null,
+                MemoryOrder.Acquire, MemoryOrder.AcquireRelease);    

You are suggesting this is better

-            System.Threading.Interlocked.CompareExchange<Object>(
-                ref _syncRoot, new Object(), null);
+            System.Threading.Interlocked.Ordering.CompareExchange<Object>(
+                ref _syncRoot, new Object(), null,
+                MemoryOrder.Acquire, MemoryOrder.AcquireRelease);    

It is not apparent to me how that changes the reviewers job.

In fact it seems to mislead. It seems to imply the original was not ordered and we are adding a more restrictive ordering. When in fact the opposite is true.

How?? Can you please be more specific?

The code changes to include MemoryOrder. That's easy to see. I'm not suggesting it's easy to validate that the right ordering is being used, but it's trivial to see that an ordering is being specified, and the moment one is, that should set off alarm bells.

17+ years of writing C# code on business side, I know barely 3 people equipped to get this fine detail right, and I wouldn't bet any one of them can do it in under 10 minutes of mental algebra.

And I would be skeptical of any Interlocked-based code written by such developers; it's not their area of expertise, nor should it be, and they shouldn't need to use this. I thought you were suggesting replacing usage of SyncRoot (which would mean replacing locks) with Interlocked. If you're talking about implementing SyncRoot, if such a developer were to need to do so (keeping in mind that it's a legacy pattern and folks really shouldn't be implementing it now anyway), I'd want to see them write:
```C#
public virtual Object SyncRoot => this;

following the common-case guidance on MSDN for returning the current instance.  If that's not appropriate, I'd want to see them just return an initialized instance without any threading impact:
```C#
public virtual Object SyncRoot { get; } = new object();

If that allocation really need to be delayed to avoid costs associated with this instance being constructed, I'd want to see them use LazyInitializer, e.g.
C# private object _syncRoot; public virtual Object SyncRoot => LazyInitializer.EnsureInitialized(ref _syncRoot);
and let the framework take care of all threading optimizations for them. Nowhere here do they need to be concerned with Interlocked. And if they really did, I'd want to see them use the simple overloads, that don't take MemoryOrder. If they specified a MemoryOrder, that they specified a MemoryOrder would be trivial to see in a code review, and I'd flag it and say "don't do that".

Please please find someone with field experience away from the compiler/JIT/GC. Your low-level glasses are skewing the picture to you, and you're going to harm the platform, and the community.

Please do not cast aspersions or denigrate others in this community. Such behavior will not be tolerated. (And for what it's worth, I have years of field experience away from compiler/JIT/GC.)

maybe this should be a different review?

That was my original intention. I will open a API review for Volatile separately.

Don't you need Interlocked.MemoryBarrier(MemoryOrder memoryOrder) as well?

Using the memory order enumeration for MemoryBarrier didn't make any sense to me, so I punted.
For Interlocked.MemoryBarrier, something like this makes more sense to me.
Interlocked.MemoryBarrier(bool orderLoads, bool orderStores = true)

Looking at x86/x64. I see mentioned three fences MFENCE, LFENCE, SFENCE. Looks like they correspond to arm/arm64 dmb ish, dmb ishld, and dmb ishst respectively.

Looking at C11 & C++11 atomic_thread_fence looks like it accepts all memory order options, but it is noted that the implementation is more restricted than the requested ordering. Which is consistent with my survey above.

Therefore Interlocked.MemoryBarrier(MemoryOrder memoryOrder) seems reasonable. I'll add it above.

@stephentoub for the history of this property, and why it is implemented this way β€” please reach out to Brad Abrams or @KrzysztofCwalina

@sdmaclea with all the strategic approach @stephentoub gave, there is no merit in this feature. Please refer to this paragraph:

And if they really did, I'd want to see them use the simple overloads, that don't take MemoryOrder. If they specified a MemoryOrder, that they specified a MemoryOrder would be trivial to see in a code review, and I'd flag it and say "don't do that".

1) Your reasons for introducing the feature are based on claim of existing API having significant cost to weakly ordered machines β€” that is not justified with any evidence, nor do you provide any use case to verify your claim.

2) Coming from a through review of a specific use case in very core of BCL itself, @stephentoub suggests MemoryOrder generally has no place in there: "don't do that" is an exact quote, and so is "use simple overloads".

3) We've already established it's "not [...] easy to validate that the right ordering is being used".

Considering the upside is based completely on gut feeling (see 1), downside is very real and (see 3), and the strategic advice is not to use this code even at BCL level (see 2) β€” facts are strongly against it.

Also, @sdmaclea you didn't show any possible code examples where MemoryOrder would add value. Perhaps if you did, doubts would be resolved. The fact that you don't suggests the niche for it is extremely narrow.

with all the strategic approach @stephentoub gave, there is no merit in this feature

Huh? That's not at all what I said!

"don't do that" is an exact quote,

For that specific developer audience you mentioned and that I was responding about. Not in general. Please do not twist my words.

and the strategic advice is not to use this code even at BCL level

Not true. We would definitely use this in the BCL / runtime in specific places, in particular to help with performance on the very platforms @sdmaclea is helping with.

Your reasons for introducing the feature are based on claim of existing API having significant cost to weakly ordered machines β€” that is not justified with any evidence, nor do you provide any use case to verify your claim.

There is a decade of experience that says this is critical to performance. That is why it was incorporated into the C++11 standard.

This is a low level feature being added to allow platform/framework developers to carefully optimize functionality. It will be useful for many .NET Core internals including the ThreadPool, Concurrent Queues, Locking. It is also anticipated it will be required by other frameworks such as ASP.NET Core.

Coming from a through review of a specific use case in very core of BCL itself, @stephentoub suggests MemoryOrder generally has no place in there: "don't do that" is an exact quote, and so is "use simple overloads".

You are taking the statement out of context. The key phrase you dropped was by such developers. This feature is not intended for those developers. Even the BCL developers should not be using this feature unless warranted. It needs to be used in very specific use cases to create safer abstractionsfor general use.

You'll also note he said roughly "Use LazyInitializer.EnsureInitialized(ref _syncRoot); and let the framework take care of all threading optimizations for them."

The implementation of LazyInitializer.EnsureInitialized(ref _syncRoot); is a place where this may be useful.

As you note this is difficult to reason. Programmers should stay away from the 'Pit of Despair'. The 'Pit of Success' is created by people fighting there way out of the 'Pit of Despair' and creating abstraction so other people no longer have to. C# allows unsafe it is generally discouraged, but it is used in internals to create safer higher level abstractions. This is the same thing. Simple rule, "Don't deliberately jump into the Pit of Despair."

@mihailik

This is no longer a productive exchange. I believe our opinions are all perfectly clear.

This is a critical feature that is needed by framework experts.

Weakly ordered systems, largely represented by ARM, are a very significant part of the worldwide market. They are not going away. C# needs to be competitive in those markets too.

BCL API review board can certainly see and take into account your opinion. They are the decision makers.

@sdmaclea, I do have one JIT-related question: presumably there's no problem here in having the MemoryOrder parameters on the overloads and still having them be recognized as JIT intrinsics with direct mappings to the corresponding asm, right? Certainly there would be some additional overhead if the MemoryOrder arguments were supplied in a way where the JIT couldn't see their values statically, in which case the implementation would need to branch to the appropriate implementation, but for cases where the JIT could see their values, I would hope these would all simply compile down to the appropriate asm instruction(s) and nothing more.

Also, from a perf perspective and to help the API review, I do think it would be helpful @sdmaclea if you could share a simple benchmark on ARM highlighting just how impactful these choices can be.

I do have one JIT-related question: presumably there's no problem here in having the MemoryOrder parameters on the overloads and still having them be recognized as JIT intrinsics, right?

I haven't really figure it out yet, but I think that would be the intent. I think it should be doable.

I do have one JIT-related question: presumably there's no problem here in having the MemoryOrder parameters on the overloads and still having them be recognized as JIT intrinsics, right?

I don't see any reason that this would be different from the hw intrinsics that take an immediate operand. Similar to that case, we'd presumably fall back to an IL implementation with a switch over the ordering parameter in the case that it was not called with an immediate (e.g. in the reflection case).

@caroleidt One of my thoughts was that this could be implemented with HW intrinsics.

Also, from a perf perspective and to help the API review, I do think it would be helpful @sdmaclea if you could share a simple benchmark on ARM highlighting just how impactful these choices can be.

Getting definitive absolute perf impact numbers is difficult. There are several complicating factors

  • Depends on microarchitectural implementation. A53, A57, A72, A73, A9, Falkor, Thunder X2 will all perform very differently.
  • Nature of the running code. Memory bandwidth, ratio of order and unordered memory, frequency of barriers, number of threads

We are collarorating with ARM to create public benchmarks, but the benchmark suite is incomplete. https://github.com/ARM-software/synchronization-benchmarks.

We have internally studied Linux Kernel, NGINX, MySQL, MariaDB, Java... Anecdotally, we found barriers and atomics are very important. They have a very significant impact on workload scalability.

We published the MaraiDB optimization at a high level in this paper. https://mariadb.org/wp-content/uploads/2018/02/M18-Conference-2018_final-002.pdf

In that paper, it shows in some cases MariaDB (MySQL fork) is 2x faster than MySQL. That is due to more theoretically correct ordering and atomics. Memory ordering changes was a part of that gain.

This JIRA "InnoDB rw-locks: optimize memory barriers" https://jira.mariadb.org/browse/MDEV-14529 has a performance improvement noted specifically for relaxing barriers from Sequentially consistent to Acquire. "I see 3-5% gain in mysqlslap benchmark for doing complex queries and 1-2% benefit in sysbench oltp workloads."

This is a real world database example. Other cases could be much more important.

I would expect greater benefit from change from "Sequentially Consistent" to "Relaxed" for use cases like AtomicAdd for shared performance statistics.

I would expect greater benefit from change from "Sequentially Consistent" to "Relaxed" for use cases like AtomicAdd for shared performance statistics.

We drafted a quick microbenchmark to look at this. For this worst case, the atomicAdd w/ SequentiallyConsistent took >2x Relaxed on the one platform we tested.

@sdmaclea @stephentoub sorry I've dropped out of the conversation for a while.

Your suggestions for a specific use case (very trivial, well known, defined and clear use case too) would not pass the code review. Your usage of Lazy allocates too much to be used in such a commonly used class as List or ArrayList. A cursory glance of an experienced C# developer would show that.

The fact that you went ahead and suggested those kind of changes to SyncRoot getter tells that you have not spent any effort considering this use case. Lacking any other use cases to start with, no effort whatsoever was spent on C# use case considerations.

Neither when PR suggested, nor when asked on PR, nor when one such use case (and relevant, simple and clear too!) was given to you.



Unfortunately, there was no good faith discussion of the issues I have raised. It's been unhelpfulin fact: all concerns are brushed off with no actual consideration. Attitude from get go such as 'if a developer is not wiling to think about memory ordering, they should probably...' is unnecessarily hostile to external feedback.

I feel discouraged from giving further feedback to CLR team. If BCL API review board, or any of you, or any of your managers wish to mend it, I'll be glad to participate. In the meantime, unsubscribing from the thread. Thanks!

Your usage of Lazy allocates too much

I didn't suggest using Lazy. I suggested using LazyInitializer.EnsureInitialized. It does not allocate too much. You might want to review the implementation to confirm for yourself:
https://github.com/dotnet/coreclr/blob/b12c344020ba4cc5bccff377c8922f5434aa293e/src/System.Private.CoreLib/shared/System/Threading/LazyInitializer.cs#L50-L51

you have not spent any effort considering this use case

I'm sorry you feel that way. I disagree. I believe I've spent a lot of time considering these use cases and your arguments. Just because I don't agree with you doesn't mean I haven't considered your position.

there was no good faith discussion of the issues I have raised

Again, I'm sorry you feel that way. I very much disagree.

Quick notes.

  • Knee jerk reaction is that these APIs feel a bit heavy handed and are likely over the top or the vast majority of Interlocked users, which is already a small set.
  • Granted these are all overloads so it doesn't really extend the concept count.
  • We should review these APIs with codegen folks in the room. On ARM, the performance gains certainly look interesting.

We should add this to our discussion about intrinsics, even though there are strictly speaking abstractions.

I'll also ping the GC team separately to follow up on the question I raised re: how these APIs interact with the card table.

@GrabYourPitchforks These should be orthogonal to the writebarriers required by the CG on heap allocations, heap reference writes....

@sdmaclea For the most part I agree, but my concern was regarding the Exchange<T> and CompareExchange<T> APIs in particular. If the GC's going to force a particular memory ordering when updating the card table, then it could affect which memory orderings are valid for those two specific operations.

That may turn out to be an implementation detail. Where the Pointer/Reference forms must use a more restrictive memory ordering, but the user can request a weaker ordering. ExchangePtr and CompareExcchangePtr are already handled differently, so your concern is very valid.

@stephentoub @CarolEidt Regarding the case where these methods are called with a non-immediate value for MemoryOrder, would it be feasible to say that they should just generate SequentiallyConsistent instead of switching on the non-immediate value at runtime? For example:

public static int Increment(ref int location, MemoryOrder mo) => Increment(ref location);

public void MyMethod(ref int foo)
{
    // Passing an immediate; the JIT can generate appropriate assembly.
    Interlocked.Increment(ref foo, MemoryOrder.Relaxed);

    // Passing a non-immediate; this is treated as a normal method call instead of an intrinsic,
    // so in the end it just turns into a standard Increment(ref int) call.
    MemoryOrder mo = (MemoryOrder)(new Random().Next());
    Interlocked.Increment(ref foo, mo);
}

I wonder if this would simplify the logic a bit. For reference, there was discussion of having a code analyzer that would flag non-immediates passed to some of the hardware intrinsic APIs. (See https://github.com/dotnet/coreclr/issues/15795#issuecomment-356431086.) That may be useful here as well.

Video

Are we sure the enum members are sensibly named? The names do not make a lot of sense to us, but that might just domain knowledge. Also, @GrabYourPitchforks just noticed that Release is being deprecated. Before approving I'd like to get confirmation on the names.

@terrajobst I see no indication that release is deprecated. The https://en.cppreference.com/w/cpp/atomic/memory_order could lead one to believe that where the enums defining the memory order are changed to enum + constexpr in C++20.

@sdmaclea Got it, I misinterpreted the _(until C++20)_ marker as applying specifically to memory_order_release instead of to the entire memory_order typedef. That's my mistake.

would it be feasible to say that they should just generate SequentiallyConsistent instead of switching on the non-immediate value at runtime? ... I wonder if this would simplify the logic a bit.

I don't see how this would simplify the logic, The existing intrinsic support in the JIT makes heavy use of the "non-immediate value falls back to recursive case which JIT expands" approach. It works and is relatively straightforward.

Are we sure the enum members are sensibly named? The names do not make a lot of sense to us

They make sense to me - and I believe they are consistent with general usage, not just in C++, but in memory model discussions.

I don't see how this would simplify the logic, The existing intrinsic support in the JIT makes heavy use of the "non-immediate value falls back to recursive case which JIT expands" approach. It works and is relatively straightforward.

What I meant is that the implementation could look like this:

[Intrinsic]
public static int Method(..., MemoryOrder order) => Method(..., MemoryOrder.SequentiallyConsistent);

You'd still have the recursive call, but this basically turns into "if the JIT can't determine the literal value of the MemoryOrder parameter, it says screw it and treats the call site as if it were sequentially consistent." That avoids us having to write a switch statement inside the method implementations.

I do not think MemoryOrder.Consume should necessarily be included, it does not seem to be well-defined in C++ and the standard recommends not using it:

memory_order::consume : a load operation performs a consume operation on the affected memory
location. [Note: Prefer memory_order::acquire , which provides stronger guarantees than memory_-
order::consume . Implementations have found it infeasible to provide performance better than that of
memory_order::acquire. Specification revisions are under consideration. β€”end note]

The C++ standard committee have also had problems defining memory_order::relaxed. Hans Boehm talks about both memory_order::relaxed and memory_order::consume here (he calls memory_order::consume a failed experiment): https://www.youtube.com/watch?v=M15UKpNlpeM&feature=youtu.be&t=1283

Was this page helpful?
0 / 5 - 0 ratings