__Edit__ by @ronawho March 2019:
The summary of this issue is that atomic reads (especially concurrent reads) are extremely slow under CHPL_ATOMICS=intrinsics. This slowdown is mainly caused by portability concerns: there's no read intrinsic, so for portability to arm and others we implement it with a CAS operation, which kills performance. Instead of trying to special case read for x86 and probably getting it wrong, we've decided to spend our effort making cstdlib atomics the default in https://github.com/chapel-lang/chapel/issues/11029
cstdlib atomics are now the default for gcc, clang, and llvm backends, and will eventually be for intel and cce backends as well. We're leaving this issue open until either llvm is our primary backend or we have support for intel and cce, but we have no current plans to improve performance for the intrinsics backend. Please ping us if there is a platform where cstdlib atomics aren't availble and you suffer from poor intrinsics performance.
Original issue:
CHPL_ATOMICS=intrinsics currently performs load and store operations using expensive compare-and-swap primitives, which add a significant overhead to loads and stores to shared locations, and also ignores the memory ordering specified by the user.
MWE
proc load(ref ptr : atomic int) : int {
return ptr.read();
}
use Time;
const n = 1024 * 1024 * 1024;
var timer = new Timer();
timer.start();
var y : atomic int;
for 1..n do load(y);
writeln("Serial Atomics: ", timer.elapsed());
timer.clear();
forall 1..n do load(y);
writeln("Parallel Atomics: ", timer.elapsed());
Output - CHPL_ATOMICS=intrinsics
Serial Atomics: 6.36694
Parallel Atomics: 23.2151
Output - CHPL_ATOMICS=cstdlib
Serial Atomics: 0.350935
Parallel Atomics: 0.120661
Note as well that since atomic loads are implemented using a Compare-and-Swap operation, a load is as expensive as any other read-modify-write operation.
MWE - RMW vs Read
proc load(ref ptr : atomic int) : int {
return ptr.read();
}
proc add(ref ptr : atomic int) {
ptr.add(1);
}
use Time;
const n = 1024 * 1024 * 1024;
var timer = new Timer();
timer.start();
var y : atomic int;
for 1..n do load(y);
writeln("Read Atomics: ", timer.elapsed());
timer.clear();
for 1..n do add(y);
writeln("Increment Atomics: ", timer.elapsed());
Output - CHPL_ATOMICS=intrinsics
Read Atomics: 6.51129
Increment Atomics: 6.49777
Output - CHPL_ATOMICS=cstdlib
Read Atomics: 0.351615
Increment Atomics: 6.23698
Note that the performance of 'read' is actually 20x slower with intrinsics than cstdlib, but not from an increment.
I will make note that this is a gating issue for developing future non-blocking algorithms, and that it will impact the performance of certain spinlock implementations, such as the Test&Test&Set Lock that achieves significant performance improvements of a normal Test&Set Lock due to being able to avoid cache-line invalidation, in which is impossible with intrinsics as a read will invalidate the cache-line.
Edit:
Pasting this from issue #10998 as I am closing it to attach information here...
"What is interesting is Serial vs Parallel Reads - instrincs, PrgEnv-cray vs Serial vs Parallel Reads - cstdlib, PrgEnv-gnu" which showed that the former is 1755x slower than the latter.
@LouisJenkinsCS is this a duplicate of https://github.com/chapel-lang/chapel/issues/10995, or it adding something new that I'm not seeing?
Made this is the issue I want to keep.
Copying from https://github.com/chapel-lang/chapel/issues/10995#issuecomment-418946201:
Note that this is likely only for load/store, other operations under intrinsics should be as fast as cstdlib.
For history, chapel-lang/chapel#4491 changed atomic load from a fence+read to __sync_val_compare_and_swap and chapel-lang/chapel#4259 changed atomic store from a store+fence to a CAS loop. We can probably optimize load/store for 64-bit platforms
My suggestion for detecting if user has 64-bit would be to use a simple preprocessor macro such as what is described here, and in the cases where the user has an unknown architecture, assume they are using a 32-bit machine. This way we cover the popular cases, and when the user complains we can add special checks for their architectures. I think we can cover 99% of users in the first attempt, and then optimize based on that.
If this is true, even spinlocks implemented in Chapel will be significantly impacted
Note that we did not see any regressions from chapel-lang/chapel#4259 or chapel-lang/chapel#4491 and we use spinlocks pretty frequently. That's not to says regressions are impossible, but we did not see any at the time.
You can also see some of the regressions we see with CHPL_ATOMICS=cstdlib from some gcc 8.1 results we gathered back in july here
You won't see any regression for a Test&Set Spinlock, but you would for any Test&Test&Set Lock which yields significant performance improvements from the assumption of the lack of excess cache line validation.
Honestly I'm interested in seeing if there is any performance boost from running a nightly with cstdlib for any benchmark that performs an atomic load and store. Are there any that do?
This is _super_ frustrating to me as this violates any and all assumptions I have made on the performance of my low-level code thus far, and very likely would have significantly killed the performance results gathered from my previous benchmarks.
You make it sound as though someone is deliberately trying to annoy you or hurt your code's performance with the code we've written. I can guarantee that this is not the case.
I emphasized how frustrating it is, not because I believe this is a deliberate sabotage of performance, but to show that I see it almost as a gating issue for developing certain HPC algorithms in the future, in particular for the kinds that employ non-blocking algorithms. I.E this isn't just me reporting a bug I came across, I am actually effected by this and I hope it gets resolved soon-ish.
I edited the description to remove that part @bradcray and instead left a note at the end addressing why I am effected by this bug.
I've a question though, can't we just force users who use 32-bit systems (the minority of people that still use 32-bit computers, like the 0.001% running on a 32-bit virtual machine) to use CHPL_ATOMICS=locks instead for the time-being?
I think the next steps here are:
@LouisJenkinsCS note that we're busy with the current release. I suspect we won't get to this for at least a few weeks.
I'm satisfied that the issue is gaining attention. I'll help by contributing tests... once I sign CCLA
The reason for the current implementation of intrinsics-based atomics is that any tiny difference in architecture seems to need it. Even some x86_64 hardware fails atomics if we don't implement the intrinsics-based atomics this way. (Skylake is one known example.)
The real solution long-term, which we have always intended, is to switch to CHPL_ATOMICS=cstdlib as it becomes better supported by more and more compilers. That allows each compiler to optimize as appropriate for its own target. (See #11029 .)
I'd like to either close this or move it to the icebox in favor of #11029. @LouisJenkinsCS, can you remind me: Is it the case that when you set CHPL_ATOMICS=cstdlib you get the results you'd hoped for? If so, I think we should close this since it's not blocking you and we don't know of a portable solution to the issue.
@dmk42 , @ronawho : Just to explore this a bit more, what would be the level of effort required to create lists of systems where we do / don't need this code path? Is it only dependent on the chip, or also on the compiler?
Yes, cstdlib does provide the expected performance I had hoped for. However, I'm not sure if this issue has been 'resolved' per-say, although I don't mind it being moved to the icebox. I'd like to keep this open as its possible others may come across a similar problem and see that the issue has not been resolved (for CHPL_ATOMICS=intrinsics at least). Unless there is an actual documented notice/warning to users somewhere in the chapel-lang documentation, in which case I'd be willing to close it since this problem would be more visible.
Offline, I just caught up with @ronaho on this issue and agree that we shouldn't close it before understanding better whether his SIZE_ALIGN_TYPE approach would help improve the performance of reads/writes on intrinsics. Putting a performance warning in the documentation seems like a reasonable alternative to me if we can't come up with a plan that will resolve the performance difference.
Did I miss something? SIZE_ALIGN_TYPE approach?
@bradcray - The use of CHPL_ATOMICS=cstdlib depends only on the compiler and the header files. The hard part is clang, where they don't keep consistent version numbers so we can't just consult one of our compiler version macros to determine if the compiler is new enough. On top of that, clang delegates a lot of the work to the operating system's header files, which means a new version of clang running on an old OS version is still not capable of supporting cstdlib atomics.
Everywhere else, we should be able to test our compiler version macros to see if it is supported.
@LouisJenkinsCS - To make 64-bit atomics work on 32-bit machines, I had to create a SIZE_ALIGN_TYPE macro in runtime/include/atomics/intrinsics/chpl-atomics.h and use it to align 64-bit atomics on a 64-bit boundary. It sounds like @ronawho is thinking of extending this to more types.
So from what I can see, __attribute__ ((aligned (sizeof(t)))) will always be __attribute__ ((aligned (64))) which makes all 64-bit integers the size of 64-bytes (the average size of cache-lines, although not all, but it might have the benefit of eliminating false-sharing on most architectures but has the down-side of increasing the memory required for each atomic int(64) by 8x. Plus it still would cause degrading performance on concurrent reads. Correct me if I am wrong.
sizeof(t) returns a size in bytes, and the aligned attribute takes a size in bytes, so there is no multiplication by 8 here.
Ah, my apologies, you're right, it is 8-byte aligned. How would this improve performance of reads/writes like @bradcray said?
Edit: Outside of the cases where you have a 64-bit integer split across two cache-lines.
For the 64-bit case, it keeps the atomic object from straddling a cache-line boundary, which it turns out is an enormous performance hit, causing some of our tests to take so long that they time out.
It sounds like Elliot is seeing smaller types misaligned as well. @ronawho ?
In either case, I'd still like to advocate for #11029, no matter what it takes, even if its only for a limited number of architectures and if a warning/notice is attached in 'chapel-lang' docs
re performance: My suggestions have nothing to do with split cache-lines
From https://github.com/chapel-lang/chapel/issues/10996#issuecomment-419246574
"""specialize read/write for 64-bit x86. May be as simple as using naked load/store with fences and using the SIZE_ALIGN_TYPE macro for all types, but we'll have to put some thought into this."""
The reason for the current implementation of intrinsics-based atomics is that any tiny difference in architecture seems to need it. Even some x86_64 hardware fails atomics if we don't implement the intrinsics-based atomics this way. (Skylake is one known example.)
I believe the reason we saw issues is because we weren't making the types size aligned and that if we aligned them, we could implement read/write as a naked load/store with fences like we used to.
That'd be exciting! I hope it works!
Oh, that's worth investigating, but the compiler already aligns the smaller types on most architectures, so I doubt that's the problem.
Oh, that's worth investigating, but the compiler already aligns the smaller types on most architectures, so I doubt that's the problem.
Note that when we switched from load/store + fences to __sync for read/write in #4259 and #4491 that we weren't even aligning the 64-bit types (which was done in https://github.com/chapel-lang/chapel/pull/5915)
performance... """specialize read/write for 64-bit x86. May be as simple as using naked load/store with fences and using the SIZE_ALIGN_TYPE macro for all types, but we'll have to put some thought into this."""
I had done some performance studies on this, and I could have sworn that I posted them on one of these issues, but I'm not finding it now so I'll repost here. Even if we go back to load/store + fence (where fence is usually __sync_synchronize), performance probably won't be on par with the cstdlib backend for loads/stores. We'd need to implement a compiler fence instead of a full memory fence, and I'm not sure how portable we can make that (and I get even less sure about correctness and ensuring sequential consistency then)
For one of the MWEs:
proc load(ref ptr : atomic int) : int {
return ptr.read();
}
use Time;
const n = 1024 * 1024 * 1024;
var timer = new Timer();
timer.start();
var y : atomic int;
for 1..n do load(y);
writeln("Serial Atomics: ", timer.elapsed());
timer.clear();
forall 1..n do load(y);
writeln("Parallel Atomics: ", timer.elapsed());
Here's what I see with gnu on a 28-core cray:
| atomics implementation | serial read | parallel read |
| ---------------------------- | ------------ | -------------- |
| master intrinsics | 5.80s | 15.20s |
| master cstdlib | 0.31s | 0.02s |
| fence + load intrinsics | 11.10s | 0.43s |
| comp-fence + load intrinsics | 0.32s | 0.02s |
Where fence + load is __sync_synchronize(); return *obj; and compiler-fence + load is __asm__ __volatile__("":::"memory"); return *obj;
IMO the priority should be trying to make CHPL_ATOMICS=cstdlib the default because all of this should be much safer/portable and because the backend may actually do something with more relaxed memory orders (whereas they are ignored by the intrinsics implementation.) If we have time (or run into trouble making cstdlib the default again) I think it'd also be worth exploring how to make the intrinsics implementation faster, but I would want to add more stress tests and really think about what we're implementing since these things are very tricky to get right
IMO the priority should be trying to make
CHPL_ATOMICS=cstdlibthe default because all of this should be much safer/portable and because the backend may actually do something with more relaxed memory orders (whereas they are ignored by the intrinsics implementation.) If we have time (or run into trouble making cstdlib the default again) I think it'd also be worth exploring how to make the intrinsics implementation faster, but I would want to add more stress tests and really think about what we're implementing since these things are very tricky to get right
:+1:
A related issue is that with C standard atomics, the compiler is allowed to impose a different alignment on an atomic object than on its non-atomic counterpart. This is why we don't need things like SIZE_ALIGN_TYPE for cstdlib atomics.
I still agree that cstdlib has priority, so this doesn't mean we should work on the following right away.
Just to document things that had almost disappeared from my memory, I went through and looked at the previous PRs. Piecing those together with the events that caused the need for the PRs, plus a few other things, here are the rules I remember.
Atomic load on ARM has to be read-modify-write. In fact, in the C committee, I'm hearing people gripe about this need.
Atomic store of objects smaller than 64 bits has to be read-modify-write on x86_64. (#4259 is just one example of this with a single-byte object, but it's true of anything smaller than 64 bits.) It might be possible to special-case exactly-64-bit integer stores on x86_64 with barriers.
In #4491 I found that a naked load of a 64-bit object on a 32-bit x86 system was not atomic. I don't remember whether this was true all the time, or whether it was sporadic. If the latter, it might be that we won't see the problem anymore now that we use SIZE_ALIGN_TYPE on 64-bit objects. In that case, we might be able to go back to barriers for loads (but only for x86 -- see point 1 above).
C compilers already align objects smaller than 64 bits. I don't believe SIZE_ALIGN_TYPE will make any difference for those cases because the compilers already take care of it.
We don't have to deal with any of this stuff with standard atomics.
I see now that this issue truly is a nontrivial one. How difficult is it to detect whether it's ARM or x86 as well as 32-bit vs. 64-bit, just as a best-effort heuristic for choosing whether to make it Read-Modify-Write or not?
Ideally we should be able to test the macros __arm__ and __x86_64__ but various compilers spell these differently, so we would have to research all the spellings and make sure we caught all the variations that mattered.
We can use sizeof(void *) as a proxy for the word size on the machines that matter to us today, though that might not always remain true.
It seems that everyone's in favor of starting by seeing whether cstdlib atomics could be made the default without losing significant performance, so I vote that we park this issue for now in favor of issue #11029. But it also sounds as though there are things we could do to tune intrinsics for some common cases (64-bit, x86, e.g.) if we weren't yet able/prepared to make cstdlib the default. Sound OK?
As an update to this issue, over on issue #11029, @ronawho has been pushing pretty hard on making cstdlib atomics the default in cases that we're able to. The current intention is to leave this issue open as a nod to the performance gap that still exists for intrinsics in some cases, particularly since we can't make cstdlib the default for all compilers as of yet.
Most helpful comment
Note that when we switched from load/store + fences to __sync for read/write in #4259 and #4491 that we weren't even aligning the 64-bit types (which was done in https://github.com/chapel-lang/chapel/pull/5915)
I had done some performance studies on this, and I could have sworn that I posted them on one of these issues, but I'm not finding it now so I'll repost here. Even if we go back to load/store + fence (where fence is usually
__sync_synchronize), performance probably won't be on par with the cstdlib backend for loads/stores. We'd need to implement a compiler fence instead of a full memory fence, and I'm not sure how portable we can make that (and I get even less sure about correctness and ensuring sequential consistency then)For one of the MWEs:
Here's what I see with gnu on a 28-core cray:
| atomics implementation | serial read | parallel read |
| ---------------------------- | ------------ | -------------- |
| master intrinsics | 5.80s | 15.20s |
| master cstdlib | 0.31s | 0.02s |
| fence + load intrinsics | 11.10s | 0.43s |
| comp-fence + load intrinsics | 0.32s | 0.02s |
Where fence + load is
__sync_synchronize(); return *obj;and compiler-fence + load is__asm__ __volatile__("":::"memory"); return *obj;IMO the priority should be trying to make
CHPL_ATOMICS=cstdlibthe default because all of this should be much safer/portable and because the backend may actually do something with more relaxed memory orders (whereas they are ignored by the intrinsics implementation.) If we have time (or run into trouble making cstdlib the default again) I think it'd also be worth exploring how to make the intrinsics implementation faster, but I would want to add more stress tests and really think about what we're implementing since these things are very tricky to get right