(PR available: #6717 )...
Below, I go over some details that are present in the current implementation
This is the common case, but certainly not a novel concept in and of itself nor to
Chapel itself, however it is sufficient for most cases where the number of locales
fit within 16 bits. It takes advantage of the fact that all known architectures and
operating systems only use 48 bits of the virtual address space, meaning we are free
to embed the 16 bits of the locale id in the upper 16 bits. This means for 65K locales,
we get the benefit of using atomics for a single 64-bit atomic operation, making
atomics extremely fast.
This could be a new design pattern really, but the idea, while novel in its application,
is simple in theory. We maintain a table of objects, one for each locale, of which
we use the index and locale id as a descriptor (a pseudo-pointer). When we want to
atomically operate using a class instance, we first hash the class instance (it's 64-bit address)
and perform an insertIfNotPresent operation on the target locale, and then embed
the locale id in the upper 32 bits, and the index into the descriptor table into the
lower 32-bits. Compression is the most expensive part, as it involves doing the above
every time (although see improvements for a huge improvement), but decompression
is cheap in that it can easily directly index into the descriptor table.
I compare GlobalAtomicObject (using descriptor tables) performance against 'Sync Var',
the all-mighty sync variable, which currently gets a handicap by just acquiring the lock,
perform some arbitrary cheap operation, then immediately release it.
This is the 'baseline' which we must beat or else the implementation is useless.
The next is Atomic on a 64-bit value, which is the 'ceiling', and can be seen as how
much performance we can get for normal pointer compression versus keeping tables like this.
It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is
a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...
var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x));
There is no retrying, just a one-and-done atomic operation (which I count as one).
In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although
a few things need to be considered. While for 'Single', it performs a write
(because a read is not interesting since it is a basic lookup in a locale's table),
for data in local memory, which shows the ideal. Now 'LLSC' can perform operations for
data that is remote, meaning that each time, its possible that during compression
of the data, we may need to jump to other locales. It shows a more average case, but
not the worse case when data to compareExchangeStrong is both remote. Currently,
'LLSC' for GlobalAtomicObject demonstrates 3 atomic operations: the read which
is a simple lookup, and compression of both variables passed to compareExchangeStrong
which one is local, the other remote.
It should be noted again that with pointer compression and a smaller number of nodes,
performance is excellent. However, when one has more than the maximum allowed, it is
still manageable and better than a sync variable is. It should also be emphasized that
the runtime of GlobalAtomicObject with pointer compression is equivalent to Atomic.
I present two graphs: one with respect to runtime, and another in terms of raw operations
per second.

This shows the runtime of the benchmark. As can be seen, raw atomic operations grossly
beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate
the power of global atomics. A single global atomic, while is magnitudes slower than a
single 64-bit atomic, is only 2x slower than the LLSC.
Imagine if we had attempted to implement some MCAS operation to update both locations using
descriptors. At best we would succeed within 4 LL/SC operations without obstruction
(2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat.
Of course, in reality, the operation would require multiple retries, in the hundreds to
thousands with real workloads that placed contention on it. This shows that, in terms of
global atomics, this is the best option currently, as a global atomic is guaranteed to succeed in one operation.

This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they
shadowed over everything else to the point that the performance boost wouldn't be noticeable.
As can be seen however, both a single global atomic operation, and even the LL/SC combo
operation with a potential remote reference (meaning an added on statement) beats a normal
sync variable. In every case, GlobalAtomicObject wins outright, but performance penalties from
remote references are very apparent and should be avoided at all costs.
Sample usage of GlobalAtomicObject can be seen below (and yes, it does work).
class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;
// Is Safe because we only use the pointer itself. However, in cases where 'a' and 'b'
// can be reused by the same GlobalAtomicObject, then this could cause undefined behavior
// where another concurrent task adds the same pointer. In those cases, you should call
// '_delete' before 'delete'.
x._delete(a);
x._delete(b);
// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted...
GlobalAtomicObject(C) can easily become atomic C. If the language were to be
modified, it wouldn't be too surprising to see the above become the below (which
of course lacks the excessive comments), which is an exemplary ideal that @mppf
insisted on:
class C { ... }
proc test() {
var a = new C(1);
var b = new C(2);
var x: atomic C;
var result:bool;
x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);
delete a;
delete b;
}
Where delete can call some hook which also removes it from the table if it has not yet been
reclaimed. I believe this can become a reality with little effort.
Below, I go over some potential improvements that can be made by others who are up to it.
It may be possible to combine normal pointer compression for the lower locales that fit
within 16 bits. Combining both means we allow better performance on average, even for
cases when they have more than 16 bits worth of locales, any locales with an id
less than 2^16 get the power of atomics, while the ones greater than or equal
to 2^16 get to use the much-slower-but-faster-than-sync-variables descriptors.
See next idea for how to differentiate.
A theoretical model where we use a more dynamic type of compression which attempts
to uses the lowest 32 bits for the virtual address, and divide the upper 32 bits
into zone bits and locale bits. A 'zone' is a base 64-bit offset identified by
zone id, which is allocated when a new unique id is found. With optimizations
such as caching the zone mappings for other locales (I.E: First time a unique upper 32
bits are found, check if we have it cached, if not found jump to the owning locale
and add it if not already, and update what we have cached, etc.)
The benefit here is that we get to keep using 64-bit pointer compression, and the amount
of unique zones that can be kept track of depend on the number of bits needed to keep
track of the locale id.
"What happens when we run out of zones?" - We use the descriptor
tables again. We can just search the unique zones for the target locale first, if the
zone isn't found and we can't create a new one, we just use descriptors.
"How do we differentiate between Zone Compression, Pointer Compression, and Descriptor
Tables?" - Its possible to use the lowest 2 bits. This means lowering the potential
maximum segments for descriptor tables, but it allows a much more dynamic compression
strategy.
This is a change that isn't theoretical but requires changes with the language itself.
Majority of the overhead of descriptor tables is hashing the object, which requires jumping
to the owning locales, but if the descriptors were cached then repeated accesses would
be just about as fast as atomics. If it were cached, the benefits of recycling memory
to be used in atomics would be tenfold (or hundredfold, as the benchmarks show).
The table utilizes a simplistic skip list implementation on a segmented memory pool.
The memory pool allows indexing by allowing an easy translation from a 32 bit
index into a segment and segment data index. It utilizes a simple bitmap (BigInteger).
The implementation of the memory pool allows wait-free reads (hence indexing is fine)
but requires synchronization to recycle/reuse memory. The skip list used to manage
memory also requires synchronization, and a lock-free/wait-free implementation requires
proper memory management such as Hazard Pointers which requires thread-local or task-local
storage. Hence, synchronization is needed and we use a simple local sync variable.
If someone managed to make everything concurrent, it could improve performance further
by 5 - 10x at least.
Edit:
Removed Introduction to avoid further derailment of discussion.
In any budding language, especially one that boasts to be a productive
concurrent/parallel language, atomics on all types are a necessity.
This statement strikes me as being a naive oversimplification. Is an atomic distributed array a necessity for any language that supports distributed arrays? I don't think so, and am not even sure what that would mean.
Is an atomic distributed array a necessity for any language that supports distributed arrays?
The emphasis would be on productivity. "What can I do if X were atomic compared to if it were not". If X is a distributed array, what do we gain? Each element in a distributed array can be updated atomically as is, so perhaps not. The distributed array could be made atomic by allowing concurrent access while resizing, which is a huge plus and would further enable productivity, sure. Would I say that this is a necessity to allowing the user to be more productive? Arguably. However, that's not the case here.
Since X is a class instance, the amount of productivity gained by making it atomic is limited to the stretch of the imagination. You can:
1) Implement more complex algorithms and data structures, like scalable wait-free algorithms
2) Allows any local algorithm to be converted into a distributed algorithm
3) Literally becomes a universal construct that allows us to make any algorithm become distributed (RCU like in the case of atomic distributed arrays, STM, garbage collection, what-have-you)
Is the above a necessity? Absolutely. However unlike atomic distributed arrays, I have presented an actual solution which works and scales.
Edit:
For more clarity: Atomics for Class Instances is a necessity, but let me expand on the universal construct bit for a moment. If you have any type of data, you can create a field in an class instance. That's nothing special, but what it allows is: any type of data, whether an int, a string, a record or some distributed array, lets call it data, which can be a field in some wrapper class instance. If you wish to atomically update it, imagine the follow:
var globalData : atomic Obj;
// Read...
var current = globalData.read();
// Copy...
var modified = clone(current);
modified.field = data;
// Update...
globalData.compareAndSwap(current, modified);
You have potential RCU. Of course the next actual difficulty would be memory reclamation, but the point is that it would have been the same whether the algorithm was distributed or not.
Furthermore, if it sounded like I was bashing Chapel, I'm not, I wish to see it improve and as such I am offering suggestions to its problems.
I think you're misinterpreting my comment. I don't have any objections to supporting atomic classes in Chapel, and have long believed that we should do so. My objection was with the opening blanket statement which I read as "any new language that has type X must necessarily have atomic type X." And I simply don't believe that to be true (and used "distributed array" as my example X). As a result of my reaction to the statement (+ time constraints), I rolled my eyes and stopped reading there rather than being intrigued and continuing on. Then eventually I came back to comment on my reaction.
To me, a more believable, less absolutist, and less objectionable statement would be something like "Any budding language, especially one that strives to be a productive concurrent/parallel language, should consider for each supported type T whether there would also be value in supporting an atomic variant of T." This statement seems far harder to argue with to me.
Understood, it makes sense that it made a broad generalization, but I didn't expect it to have the kind of effect which would cause people to skip over it from the first sentence. I revised it so that it didn't generalize to any type, but to a specific type of data.
Thanks!
To make sure I'm understanding the following:
let me expand on the universal construct bit for a moment. If you have any type of data, you can create a field in an class instance. That's nothing special, but what it allows is: any type of data, whether an int, a string, a record or some distributed array, lets call it data, which can be a field in some wrapper class instance.
Are you saying that the original statement was intended to convey "Supporting atomic classes in a language effectively adds support for arbitrary atomic types as well because given any type T we can create a class C_T with a field of that type and now by declaring an atomic C_T, we effectively have an atomic T"?
Are you saying that the original statement was intended to convey "Supporting atomic classes in a language effectively adds support for arbitrary atomic types as well because given any type T we can create a class C_T with a field of that type and now by declaring an atomic C_T, we effectively have an atomic T"?
Technically, it can yes. The semantics would change (requires RCU or similar updates), performance may drop significantly, but it is possible to emulate atomic T with atomic C_T with effort (even if it is a lot). The point I was trying to make was that it necessary in that it made such things possible (even if they weren't optimal), but that was before you informed me the issue was because my generalization (of which I may have done again).
@gbtitus and @ronawho are planning to look at this proposal. (and me too, but I've already looked at it enough to be comfortable).
@mppf
I had another idea in terms of how it can be implemented into the language itself. What if you had one global table (not coupled to a GlobalAtomicObject, but initialized first thing at runtime) that was managed across all nodes. Management of the objects (adding, propagating changes across nodes, removing them when the object is disposed of with delete, etc) is up to you (and I'm sure you guys can think up much better solutions), but this allows us to use the whole 64 bits. Embedding the 64-bit descriptor number (index into the table) into the object is still a valid optimization here, and we allow 2^64 potential objects. The compiler can properly instrument read, write, etc.
Boom, you've got your global descriptor table. The low level details would be beyond me though.
@LouisJenkinsCS - I'm not sure what to interpret from your other idea other than that we could come up with a completely different solution from what you created, but I already knew that. But I'm probably missing something. What would be the advantage of having one global descriptor table?
@LouisJenkinsCS: For some reason I'm feeling like one of the edits to the original proposal removed a lot of the context. Basically the need is for AMOs on global pointers which conceptually are larger than 64 bits (and we don't have 128-bit network AMOs), and the specific operations we need are just Load, Store, Swap, and Compare-and-Swap? Is that correct?
@mppf
Allow me to contrast having one single global descriptor table to having one coupled to GlobalAtomicObject (or each atomic Obj).
1) It makes caching the descriptor possible in O(1) space, as it can be added directly to the object's header the first time it is used in any atomic operation (via checking if its cached descriptor is set). If you have a unique table coupled to each atomic Obj, then the cached descriptor is specific to each instance of atomic Obj.
2) There can be one table to rule them all, in that the table could hold 2^64 arbitrary wide_ptr_t data, and have it store any kind of object. Currently, you'd need one descriptor table per atomic object, and even if you had one table per runtime type, a lot of space will be wasted. This can be done in just one extra level of indirection.
3) atomic Obj can become just an arbitrary descriptor index.
Edit:
To backup whether this is more than enough: as you stated, an object must take up at least 32 bytes, so we're talking about 590.29581 exabytes worth of data before space becomes an issue.
@gbtitus Yes, originally I had stated that in the removed introduction, and yes we only need those operations.
@LouisJenkinsCS - I think using a single global table is interesting as an improvement. Instead of asking us to solve all the problems - can you describe what operations should happen when to support such a thing. What would you need the compiler to do? What does the runtime do? I would expect as we integrate more with the compiler/runtime you'll need help to do so.
However I think we should view it as an improvement/optimization and try to get a 1st version merged. I.e. I think we should discuss the global version later and in a different issue. For the purposes of this issue, I think it would nice to simply say why it is harder (what are the challenges?).
I feel like I must be missing something obvious. One solution would seem to be to implement those AMOs by simply doing an on-stmt to the node where the AMO needs to be done and then doing it as a 128-bit processor AMO (either directly if there is CPU support, or indirectly using the x86 128-bit compare/exchange instruction CMPXCHG16B). In the current implementation this would take 2 network round trips per AMO, all things considered. What are the advantages of the proposed solutions over doing that?
@gbtitus
The issue would be atomically updating them. What happens when you have another atomic operation begin while another is ongoing? (Misread what you meant by doing an on-statement, yes that would work if there was hardware support). In terms of the current implementation, what it does is allow us to have an immutable pseudo-reference to a wide pointer that we know won't change and can be represented by a normal 64-bit integer. I'm not sure how CMPXCHG16B is implemented, can you fill me in on the specifics? Does it do a double LL/SC that causes many operations to retry if they fail, or does it atomically operate on whole 128-bit data words? Regardless however, not all hardware have support for such operations and this would be a software implementation to fallback on.
@mppf
With the current version, we can do the following...
1) Implement atomic C as new GlobalAtomicObject(C).
2) Add primitives such as __primitive("get cached descriptor", this, obj) and __primitive("set cached descriptor", this, obj), which can potentially map the current GlobalAtomicObject to the object's descriptor.
Currently, that's all that would be needed to get a nice efficient implementation going.
CMPXCHG16B is a 16-byte atomic compare-exchange. The LOCK prefix is needed in multicore situations. It's true that this operation is not necessarily present on all architectures -- it's not implemented on early AMD x86_64 and there may well be other processor architectures that don't have it at all. On those one would need to do something else.
However, I'd argue that the "something else" should not do any network ops. In other words, no matter what actual mechanism we use for the pointer updates, it should only involve local operations, because network communication is very costly. I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.
That still leaves a lot of maneuvering room for implementation, though. For one thing, we might be able to go a bit further with straight pointer compression. If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.
The bluntest possible tool might be something like maintaining, on every node, a local array of sync vars index by node, and protecting all updates to global pointers with a given locale value by locking the corresponding local sync. (I did say it was a blunt tool!) Speaking of which, in your comparison with sync var protection at the top, where (which node) were the sync vars involved, and was there one per global pointer, or just one for everything, or something else? Would it be possible for you to post your scaling comparison code(s)?
For some reason I'm feeling like one of the edits to the original proposal removed a lot of the context.
I also got confused by trying to read the evolving document -- started last week, couldn't find my place this week.
CMPXCHG16B is a 16-byte atomic compare-exchange.
Is there a similar atomic operation to perform reads, writes, and exchanges for 16 bytes? If it is anything similar to a Double-Word Compare-And-Swap, then wouldn't its purpose be to update two arbitrary 64-bit locations? While each location may be updated atomically, if you had to read each 64-bit value separately, then couldn't another atomic CMPXCHG16B operation could change its value mid-read?
However, I'd argue that the "something else" should not do any network ops.
Local to what? If the data refers to something remote and another remote node can atomically set it, then network operations are unavoidable. Communications across nodes are costly, but they do scale. If bounded, they are a constant overhead that can be managed.
I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.
Not necessarily. Again, by caching the descriptor, this means that the only remote operation occurs the first time the object is used. Therefore, this remote operation merely occurs once. The other operation is using network atomics to set the 64-bit atomic field, which is also unavoidable. A network atomic operation would beat using an on statement any day (or at least in Chapel it does, I don't know how well optimized you can make it in the runtime).
If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.
That's an interesting approach, definitely one that can be used in the meantime, but relatively soon you would run out of bits.
Would it be possible for you to post your scaling comparison code(s)?
I'm going to respond to the parts of this separately, since I suspect some will be dealt with and disappear rapidly while others lead on to further conversation ...
CMPXCHG16B is a 16-byte atomic compare-exchange.
Is there a similar atomic operation to perform reads, writes, and exchanges for 16 bytes? If it is anything similar to a Double-Word Compare-And-Swap, then wouldn't its purpose be to update two arbitrary 64-bit locations? While each location may be updated atomically, if you had to read each 64-bit value separately, then couldn't another atomic CMPXCHG16B operation could change its value mid-read?
No, it's an octoword compare-exchange, that is, it operates on an single aligned sequence of 16 bytes, not on two separate 8-byte quadwords.
@gbtitus - I think it'd be reasonable to do an on with a CMPXCHG16B if CPU support is available. I view the approach @LouisJenkinsCS is trying here as the fall-back - in case 128-bit processor atomics are not available. Additionally I can't predict if 64-bit compare-and-swap but extra work on creating/destroying a instance for the atomic object would be faster for a given application.
A separate concern is that some algorithms might want to use a generation number in the other 8 bytes, so CMPXCHG16B could protect against the ABA problem.
@LouisJenkinsCS - I'm pretty sure CMPXCHG16B only updates contiguous 16 bytes.
@gbtitus - I'm not so sure the descriptor table results in more communication. @LouisJenkinsCS - please correct me if I am misunderstanding the strategy here.
In particular a typical pattern with compare-and-swap for objects is something like this:
e.g. updating a single-linked list in a lock-free manner.
Since the creation of the object occurs on the same place doing the compare-and-swap, the descriptor table that is updated is the one that is local to that node. Updating the descriptor table gives a 64-bit identifier for the class instance, which is then CAS'd in.
Of course, if another node goes to use the 64-bit identifier (say by reading the atomic class instance and then dereferencing it), the (remote) descriptor table will need to be consulted to get the wide pointer corresponding to that identifier.
But, if a node is dereferencing a remote class instance, there will be communication anyway. (Unless some caching strategy is used, say --cache-remote, but that could also cache the remote descriptor table).
However, I'd argue that the "something else" should not do any network ops.
Local to what? If the data refers to something remote and another remote node can atomically set it, then network operations are unavoidable. Communications across nodes are costly, but they do scale. If bounded, they are a constant overhead that can be managed.
Local to the place where the memory is being updated. This does get confusing, because we have two notions of locality. One is the node where the pointer points, that is, the node part of the global pointer. The other is the node where the global pointer resides, that is, where it lies in memory. In this case I'm talking about the latter. (I got confused about this myself in another part of the conversation, which I'll address in a few minutes.) I'm saying that when we update a memory location containing a global pointer, it should be sufficient to just do an on-stmt to the locale where that memory location is and then change it using only local operations, no network communication. And I think you'd say that a Descriptor Table-based solution actually obeys this rule because even though the table access conceptually involves network communication, in practice by caching table entries we can avoid the network.
I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.
Not necessarily. Again, by caching the descriptor, this means that the only remote operation occurs the first time the object is used. Therefore, this remote operation merely occurs once. ...
Got it, I hadn't snapped to this being the purpose of the caching.
... A network atomic operation would beat using an on statement any day (or at least in Chapel it does, I don't know how well optimized you can make it in the runtime).
Yeah, an executeOn (the runtime moral equivalent of a Chapel on-stmt) can't even get close to a network AMO except in very specialized circumstances, and certainly can't ever match or beat it.
One thing I'd need to cleared up here is this: Are we talking about the current implementation, or the ideal implementation?
In both cases... @mppf That is correct.
If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.
That's an interesting approach, definitely one that can be used in the meantime, but relatively soon you would run out of bits.
Sure, depending on the definition of "soon", heh. But keep in mind that you only have to use this technique as long as you don't have an 16-byte compare-exchange AMO. If that's getting more common faster than large-scale systems are threatening to grow over 64k nodes (so straight pointer compression can be used), then it's the way to go. Plus it's not actually 64k-node systems we have to worry about, rather it's 64k-node distributed Chapel programs, and other scalability constraints may prevent that for a while anyway. So many engineering tradeoffs ...
I said:
The bluntest possible tool might be something like maintaining, on every node, a local array of sync vars index by node, and protecting all updates to global pointers with a given locale value by locking the corresponding local sync. ...
This is nonsense. I confused the node in the global pointer with the node on which the global pointer was being updated. In general if I'm updating a field containing a global pointer, the node numbers in the old and new values of that field are not the same and I can't do what I described. So, never mind this idea!
keep in mind that you only have to use this technique as long as you don't have an 16-byte compare-exchange AMO
Okay, this brings up my previous question. Is there a way to atomically read/write/exchange 16-byte data? Atomic updates of any class instance globally I believe would be significant. I've looked around for anything like read16b, write16b, etc, and can't find anything. My question is: if you try to read a wide pointer and someone else does a CMPXCHG16B, would it read what was there before CMPXCHG16B finished, after, or in between?
Okay, this brings up my previous question. Is there a way to atomically read/write/exchange 16-byte data? Atomic updates of any class instance globally I believe would be significant. I've looked around for anything like
read16b,write16b, etc, and can't find anything. My question is: if you try to read a wide pointer and someone else does aCMPXCHG16B, would it read what was there beforeCMPXCHG16Bfinished, after, or in between?
I think you'd have to emulate read/write/exchange by means of CMPXCHG16B itself.
I think you'd have to emulate read/write/exchange by means of CMPXCHG16B itself.
That definitely is something I would like to test. (Note I am assuming that it is implemented in the hardware in such a way that resembles a normal compare-and-swap on 8 bytes) That means that you have a lot of intra-node traffic being done (and inter-node traffic with respect to NUMA nodes) just to access data in that manner. If you have a large amount of traffic (say each task from each node in a cluster accessing one atomic field, which isn't too uncommon for distributed data structures), then having all of those on statements would result in a massive performance loss. Even normally wait-free atomic operations like write and exchange just become lock-free. Add to that the fact that you have so many remote tasks absolutely overwhelming the node on which the memory lies on, and I think perhaps descriptor tables at the idealized 'everything-cached' solution may be a viable alternative, in that it avoids the 'CAS retry problem' bottleneck.
However, if the x86 LOCK prefix provides mutual exclusion at a hardware level with no need to retry, then it boils down to acquiring a (very low-level) spinlock for all operations, right?
Would it be possible for you to post your scaling comparison code(s)?
The sync code seems rather unfair, since you're using a program-wide sync var to single-thread everything. Probably a production implementation would do something that scaled better, such as having a number of sync vars on each locale, say a multiple of the core count so as to reduce contention, and then for the update, doing an on-stmt to the locale where the field lived, somehow using the field address to select a sync var, and then using that sync var to protect the update. This would scale. I'd believe you if you argued that it wouldn't scale as well as solutions using network AMOs could. Nevertheless, when comparing to an alternative technique you don't like, it's always more convincing to show that alternative in the best possible light, not the worst. If you could do this and rerun, that would be great to see.
Also, for the code where you're using CompareExchange, I think you need an additional inner loop. The code at this line, for example, doesn't necessarily replace the value x with the value x + 1 in the variable atomicValue. It only does so if atomicValue actually contains the value x at the time the operation is done. I think instead you'd need something more like this:
const x = atomicValue.read();
while !atomicValue.compareExchangeStrong(x, x + 1) do chpl_task_yield();
Basically, this runs until comparison succeeds and the exchange occurs. (The yield is there to avoid starvation, if some other task needs to run in order to create the situation that allows you to make progress.)
The sync code seems rather unfair, since you're using a program-wide sync var to single-thread everything.
You are definitely right, the syncOp benchmark uses a very naive implementation, but I also gave it a handicap as well. I was going to compare and contrast the overhead of a read, write, exchange, and compareExchange for sync, but any performance difference would be negligible as 99% of the overhead is from acquiring a remote lock. Ignore what is going inside of the lock, just the fact that a lock is acquired to do something.
Probably a production implementation would do something that scaled better, such as having a number of sync vars on each locale, say a multiple of the core count so as to reduce contention, and then for the update, doing an on-stmt to the locale where the field lived, somehow using the field address to select a sync var, and then using that sync var to protect the update.
The field is located on one locale, so in this case fine-grained locking won't work. As well, all examples here are rather simplistic in nature, but you do raise a valid point on performing an on statement on the node that owns it, which would definitely be better than remotely acquiring it, its just that the lock in this case is protecting one piece of data, and cannot be made any more granular than that.
Also, for the code where you're using CompareExchange, I think you need an additional inner loop. The code at this line, for example, doesn't necessarily replace the value x with the value x + 1 in the variable atomicValue. It only does so if atomicValue actually contains the value x at the time the operation is done.
See the 'Performance' section in the original post. I chose to avoid adding contention for CAS retrying because I know that a CAS retry loop won't scale. In fact, I'd argue that a sync variable can scale better than a 'CAS Retry AMO` can due to the fact that the contention would cause a significant decline in performance as we add more tasks.
Clarification
In summary: I will rerun the benchmark with LL/SC being a retry-loop, and have syncOp use an on statement.
@gbtitus
Actually, perhaps I could construct a better benchmark, one in which we fairly compare both sync and GlobalAtomicObject and leave normal atomic out in general (as it has to do with atomic operations using class instances). Would you agree to that as a better benchmark? I'm thinking of constructing a list... It goes like this...
var head : GlobalAtomicObject(node);
var tail : GlobalAtomicObject(node);
var ourNode = new node();
var tailNode = tail.exchange(ourNode);
if tailNode == nil {
head.write(ourNode);
} else {
tailNode.next = ourNode;
}
var lock$ : sync bool;
var tail : node;
var head : node;
lock$ = true;
var n = new node();
if tail == nil {
tail = n;
head = n;
} else {
tail.next = n;
tail = n;
}
lock$ = false;
In this case its fair in that they do roughly the same (although the GlobalAtomicObject isn't guaranteed to have everything be visible immediately).
NumLocales | Global Atomic List | Remote Sync List | On-Stmt Sync List
------------- | ---------- | -------------------- | --------
1 | 0.111723 | 1.10026 | N/A
4 | 13.1698 | 106.642 | 76.1085
Any more and it would take far too long to be worth it. In this case, GlobalAtomicObject wins, but that's because the algorithm for the list wait-free and doesn't care about when nodes become visible, but again its meant to demonstrate raw atomic operations here. Code here. I'd like to emphasize again that I only recommend usage of global atomics for things that are wait-free or bounded in terms of # of retrying. Of course, if need be one can employ the "fast-path slow-path methodology -
slidepaper"
Edit:
The performance boost in GlobalAtomicObject is likely due to switching from using Chapel arrays to tuples.. which is actually fascinating!
I think you'd have to emulate read/write/exchange by means of CMPXCHG16B itself.
That definitely is something I would like to test. ... That means that you have a lot of intra-node traffic being done (and inter-node traffic with respect to NUMA nodes) just to access data in that manner. If you have a large amount of traffic (say each task from each node in a cluster accessing one atomic field, which isn't too uncommon for distributed data structures), then having all of those
onstatements would result in a massive performance loss.
Just a quick note in case there's been confusion: it wouldn't necessarily be required that we do full Chapel-level on-stmts to implement this stuff using CMPXCHG16B. It could be done with direct runtime support for a low-level equivalent ExecuteOn which would have much lower overheads (and wouldn't need a full-blown Chapel task on the remote side). So, I think it's fair to interpret any use of the term "on-stmt" in here as referring to some kind of remote execution but not necessarily a full Chapel on. That said, the Active Message handling in support of the remote execution would still be required, and that's the largest cost difference between a network AMO, and a processor AMO initiated remotely.
Even normally wait-free atomic operations like
writeandexchangejust become lock-free. Add to that the fact that you have so many remote tasks absolutely overwhelming the node on which the memory lies on, and I think perhaps descriptor tables at the idealized 'everything-cached' solution may be a viable alternative, in that it avoids the 'CAS retry problem' bottleneck.
"Absolutely overwhelming" seems hyperbolic. Here's why. I believe that Read, Write, Exchange, and in most usage even Compare/Exchange itself (e.g. when used for pointer swinging) can be implemented using remote execution of a processor AMO CMPXCHG such that the CAS retries are done locally on the remote node, not over the network. Now, either the field we're updating is contended or it is not. If it is contended then even if we arrange (by a descriptor table, say) to only need 64-bit CMPXCHG we'd still better not do our CAS retries over the network because network refs are 1000x slower, recall. So if the update is contended we want to use remote execution and processor atomics even if we do have 16-byte network AMOs. If the update is not contended, on the other hand, then because remote execution is definitely slower than network AMOs (probably no better than 1/4 as fast), a solution based on 8-byte network AMOs, such as compressed pointers or descriptor tables, will definitely outperform one based on remotely initiated 16-byte processor AMOs. Nevertheless, the node won't be "overwhelmed", not by an un-contended update. Un-contended updates can't overwhelm anything.
(Just to cover all the bases: if remote execution of processor atomics is 4x slower than network atomics then granted there is a range of rates for inbound un-contended updates such that network atomics can do them at full speed while remote execution and processor atomics will be throttled. But such update rates can't be reached without really serious workload imbalance, for example all the nodes pounding on one of them, and I'd argue that this is a special case we need not design for.)
However, if the x86
LOCKprefix provides mutual exclusion at a hardware level with no need to retry, then it boils down to acquiring a (very low-level) spinlock for all operations, right?
You need the LOCK prefix for any x86 CMPXCHG atomic op instruction if you want it to be safe in a multicore setting. As far as I can tell the CMPXCHG16B instruction works exactly like every other x86 atomic op in this regard. There's nothing special about it.
If the update is not contended, on the other hand, then because remote execution is definitely slower than network AMOs (probably no better than 1/4 as fast), a solution based on 8-byte network AMOs, such as compressed pointers or descriptor tables, will definitely outperform one based on remotely initiated 16-byte processor AMOs.
You've convinced me that in most cases your approach to using hardware atomic operations would be the best general solution, but it has me wondering if the user should have an option to pragma that a certain field or local variable is not a point of major contention and perhaps use an approach that is 4x faster. Just an idea.
CMPXCHG16B instruction works exactly like every other x86 atomic op
Gotcha, I definitely should have done some more research first. That further dissolves any doubt I have.
All in all I think at this point I might favor something based on compressed pointers, myself. Even when we have CMPXCHG16B the remote execution needed to use it is no better than 4x slower than network AMOs and in real-world use may be much worse than that. The compressed pointer solution wins out over the descriptor table solution for me due to avoiding (as I understand it) the need to do anything at all when a class instance is created or destroyed, and the complication of caching table entries. In fact, really all it does is change our wide pointer representation from this (as expressed in C):
struct {
uint64_t node;
uint64_t ptr; // cast to type* to use
};
to this:
struct {
unsigned int node : 16;
unsigned int ptr : 48; // cast to type* to use
}
Granted that we're limited to 64k nodes with this technique, which would make it architecturally our tightest constraint. That is, it would be the place where we were closest to not being able to reflect the whole architecture. But we could apply it only to atomic class instances as opposed to all class instances if we wanted to and as I think I mentioned before, we're still probably some way off from being able to effectively scale a Chapel program to this size anyway.
@gbtitus
Could we at least strike a compromise: compressed pointers built into the language, but GlobalAtomicObject available as a potential module for those who wish to use more than that? Again, it does work, it may as well be offered, right? That way if there is enough demand, perhaps later the idea of integrating it into the runtime can arise.
... wondering if the user should have an option to
pragmathat a certain field or local variable is not a point of major contention and perhaps use an approach that is 4x faster.
Agreed -- I think the language could benefit from a notation that a particular atomic var (field or otherwise) is expected to be contended or not. This is useful information both for programmers maintaining the code later, and for the compiler/module/runtime which together arrange to implement the atomic var. I'm not particularly a fan of pragmas, myself. Maybe we could say contended atomic instead of plain atomic. :-)
I'm just catching up on this issue and I have a few things to say.
I think the discussion conflates two issues:
a. using a network AMO in a CAS-loop (at all, and in particular for remote AMOs)
b. the descriptor table approach vs using compression/CMPXCHG16B
on statement around the CASon statement around the CASAnd then, is it possible to write the benchmarks in a way that makes a configurable amount of contention?
Then I'd like to see comparison of these approaches:
sync var in place of the atomic object in the most naive way)Then, we can gather evidence to suggest if Greg is right about which algorithms to prefer in high contention vs low contention.
Lastly, it might be that pointer compression + CMPXCHG16B is enough; we might be able to reasonably assume that a system will support one or the other.
Also, I think an important future step (not necessarily for this issue/PR) is to investigate how one can solve the ABA problem for lock/wait-free multilocale programming in Chapel.
I still have a few responses to write on this issue, but none of them are big and I expect we're winding down toward conclusion.
In the meantime, a much more fundamental question occurred to me last night long after I'd left work: do we really expect that people will write the kind of pointer-swinging code in Chapel that they do in, say, C? I'm not asking whether they'll want to write lock-free or wait-free algorithms; certainly they will. What I'm asking is: when they write those algorithms in Chapel will they do so in a C-like way by pointer-swinging class instance refs, or should we expect (and even encourage) them to do it in a more abstract way using other data structure representations?
Could we at least strike a compromise: compressed pointers built into the language, but
GlobalAtomicObjectavailable as a potential module for those who wish to use more than that? Again, it does work, it may as well be offered, right? That way if there is enough demand, perhaps later the idea of integrating it into the runtime can arise.
I'm not averse to this but I think it's more a question for some other reviewer (looking your way, @mppf). My expertise is in mostly lower in the software stack.
@mppf
another plausible example that uses a CAS loop
I can do the next best thing. I'll try to globalize a current (well-known) data structure to test how they handle massive amounts of contention: Michael Scott's Lock-Free Queue. The queue's performance drops as you add more threads due to excessive contention placed on the head and tail, so at this point it comes down to 'How much'. I can do it based on runtime to test it. However, it should be noted again that we still have some issues with a lack-of-caching, but I can probably remedy this by allocating the queue nodes and caching their descriptors ahead of time. This could be very interesting... (I'll leave the sync variable out though).
Forgot to ask: how would I test the cmpxchg16b thing from Chapel? Or would it have to be called from C?
@LouisJenkinsCS:
You'd have to do it from C and even then, probably only gcc supports it and that not very well. I went looking at that stuff and it's sort of half-supported. There's an __int128 built-in type and the atomic compare-and-swap intrinsics can work on it, but you have to throw a compiler option (-mcx16) to enable that. So I'm not sure about the practicality of depending on it for production code right now.
Hm... @mppf Any suggestions on how to proceed on testing this? Is there a way to have chapel build with -mcx16 flag added?
I thought cmpxchg16b would just amount to some assembly code snippet (in actual assembly or assembly-in-C). I wouldn't rely on GCC intrinsics for it at this stage. Actually it's used in qthreads and gasnet - see https://github.com/chapel-lang/chapel/blob/master/third-party/qthread/qthread-src/src/threadqueues/nottingham_threadqueues.c#L163 and https://github.com/chapel-lang/chapel/blob/master/third-party/gasnet/gasnet-src/gasnet_atomic_bits.h#L904 - perhaps if we ask nicely it one of these could be pulled out into its own header file.
Anyway I do think the experimentation I described is worthwhile as we decide what needs to be directly supported in the language.
Gotcha, I'll try to have it and the results by Friday
Here's one more little bit of possibly useful information. I've been estimating that remotely initiated processor atomics probably can't perform much better than 1/4 the speed of network atomics. It turns out this is really easy to test. On a Cray XC with comm=ugni, we can toggle the ra-atomics test between doing network AMOs and doing remotely initiated processor AMOs just by whether or not we have a hugepage module loaded. In order to use network AMOs we have to register memory with the NIC and in order to do that we have to put the heap on hugepages. No hugepages, no network AMOs. And when we do use remotely initiated processor AMOs it's through a fairly lightweight runtime-internal mechanism, not a Chapel on-stmt. So this is a pretty good analogy to the situations we've been discussing. (In this case the AMO is a XOR rather than a CMPXCHG, but I doubt that matters.) It turns out that network AMOs are 30x faster than remotely initiated processor AMOs for this test, not just 4x or 5x faster. Now granted my 4x was a floor and I haven't dug into why the difference is so much greater than I expected (I have more ideas than time) but I thought I should mention this result.
Interesting... while it still doesn't leave me with much room for doubt that remote AMO is better for contention-heavy CAS-retry loop, it does give me hope for descriptor tables. I wonder though, does that 30x increase or decrease as you add more nodes?
Yes, definitely any CAS-retry spin loop should be done on the node where the target datum resides rather than over the network. What's tricky is that if you have an appropriate network AMO and don't need to spin then you should definitely use the network AMO, because remote execution is so much slower. Thus knowing whether contention will occur is really important.
The 30x increases as I add nodes: roughly it's 19x on 2 nodes, 30x on 4 nodes, 34x on 8, and 37x on 16 in a set of runs I just did. I think this just reflects that as the node count rises, a greater proportion of network AMOs have to traverse the network. ra-atomics has a block-distributed target array that spans all the nodes, and parallel tasks on all cores and nodes do atomic XOR updates to randomly selected elements of that array as fast as they can. On XC all these updates are done via network AMOs through the NIC. If the update happens to hit an array element local to the node then there is no network traversal, but we nevertheless do have to use a NIC AMO instead of a processor one. This is because the Aries AMO cache is not coherent with the processor cache(s) and thus, for a given memory location that is an AMO target, all AMOs to that location must be either done with the processor (even if the location is remote) or done with the NIC (even if the location is local).
@mppf @gbtitus
In such short notice, I'm afraid I can't do too much right now, and I'll have to devote the rest of my time for catching up on my project (I'll be falling behind at this rate, to be honest). The below shows three things: 'GDTWrites' which is a simple write using the descriptor tables (no cache, meaning worst case when using local data), GDTReads which is a simple read using the descriptor tables (no cache), and 128bit-Read which is an emulated read using the only 128-bit hardware instruction available: cmpxchg16b. It is emulated in as an efficient manner as possible:
inline void read128bit(void *srcvp, void *dstvp) {
uint128_t *src = srcvp;
uint128_t with_val = *src;
uint128_t cmp_val;
uint128_t *cmp = &cmp_val;
uint128_t *with = &with_val;
char result;
__asm__ __volatile__ ("lock; cmpxchg16b (%6);"
"setz %7; "
: "=a" (cmp->lo),
"=d" (cmp->hi)
: "0" (cmp->lo),
"1" (cmp->hi),
"b" (with->lo),
"c" (with->hi),
"r" (src),
"m" (result)
: "cc", "memory");
*(uint128_t *)dstvp = cmp_val;
}
It is as optimized as you can get: regardless of if a CAS fails, it returns what is read at that location in cmp, which is why I have return that, so it is still a wait-free operation in that it only performs 1 CAS operation, and it is the only way to safely read all 128-bits atomically.
The results of the benchmark are surprising. The descriptor table won outright, by far, and as I mentioned before, it scaled. Remember this benchmark is in terms of raw time. As well, the benchmark consisted of many rapid operations in a 'for-each-task-on-each-locale' kind of way.

However, @mppf I promised to deliver on more benchmarks, and I'll try to do so at the earliest time possible, potentially over the weekend I'll be able to do so, but if not then hopefully sometime within next week. Again, I need to catch up on my project first and foremost.
My thoughts on why the Descriptor Table won (so far) is the following...
1) Assuming an equal balance of work from each node, since compression/decompression takes place on the node that owns the object, the computational work load is also balanced. This is why the hardware approach drops in terms of performance because all nodes must jump to a single node. Also as well, in this benchmark work that is added is created locally, so there's no need to jump to another node. Unfortunately my explanation isn't anywhere near as in-depth as @gbtitus but hopefully this can suffice.
2) Reads for the table can occur with a simple lookup, no need to grab locks to add to the table, just a pure lookup. This can be made true for writes as well if its expanded on.
I'll add more if I think of anything else to add here.
Okay, I am going to be doing this incrementally, as this is the easiest way to do this. I also restricted it to 8 nodes 32 nodes instead of 64 to save time. What I'm going to do is show results of all operations; read, write, and compareExchange (with and without local on statements). Again, this is important in that both GlobalAtomicObject and CMPXCHG16B handle a write atomic operation differently from a read atomic operation.

Revisited read again, this time also comparing to pointer compression. GlobalAtomicObject is right in the middle and scales well enough.
GlobalAtomicObject shows how much time it takes to read a descriptor from a locale's table (likely remote). PtrCompression shows how long it takes for a normal network AMO + Pointer Decompression. CMPXCHG16B shows how long it takes to jump to the owning locale and perform an atomic read operation.
This one shows atomic 'Write' operations, which vary a lot more.

As can be seen, the overhead of writing for descriptor tables is massive when the descriptor is not cached, but its more of a constant. It shows that CMPXCHG16B is much slower than anything else. As before, GlobalAtomicObject is in between both PtrCompression and CMPXCHG16B.
Would like to note one crucial flaw in the descriptor table... this can't be solved from caching alone as its unavoidable. The flaw is that once the table begins filling up, adding elements to it becomes significantly slower. I'm assuming the issue is with my skip list implementation (in my defense it was my first time ever implementing, and it was based on the 'Skip List Cookbook' one). In reality once its cached on first use, there's no need to be have the skiplist, In the current implementation, performance is so abysmal it couldn't even finish at 1 node within 5 minutes (it starts very fast, then slows to a crawl).
Huge oversight, but still plenty of potential here. I'll try to compose a benchmark reusing memory for the time being.
Edit: I'm wondering if a potential bottleneck could be BigInteger.scan0() since I use that heavily to find the next free bit in the memory pool...
Okay, after a very long hiatus, I believe I can finally make some more progress on this again.
One change I'm going to be making is stripping the skip-list, as searching for it tends to be the limiting factor, and becomes irrelevant later once caching is implemented. So instead, I'm going to regress to the original design of having to 'register' each object in the table, which then returns a tuple of both the object and the index as the descriptor. So...
extern type wide_ptr_t;
type descriptor_t = (wide_ptr_t, int(64));
proc register(obj:objType) : descriptor_t { }
In the original document for the idea, here, the performance for it was phenomenal, but that was because it did not have to worry about hashing objects to find their descriptors. This can still become a reality if caching it performed well enough. I'd like to explore this when I have the time.
Note: The final version likely will not require the user to 'register' anything and just requires proper tracking of the descriptor by runtime. As I'm short on time, this chapel-side version is the best I can do at the moment.
Also as well, I plan to privatize tables across all nodes in the cluster, and use my RCU (trademark-aside) abstraction for allowing updates concurrently. I'll find a decent way to propagate such updates, but overall it will favor reusing memory over rapid creation and discarding of new memory to be registered.
Okay, I think I have a way in which I can handle managing the descriptor table. There will be only one descriptor table, wherein each cell is capable of holding any wide_ptr_t, or rather any 128-bit value. In fact, this can be generalized to allow any size, even variable-sized, data depending on the way I implement it.
First and foremost, I'll need to clarify that the RCU and the DistVector implementation (which I had shown to @mppf in an email with @e-kayrakli) likely must come first to allow for concurrent reads while resizing the tables. The DistVector would serve as, essentially, a parallel-safe resizing distributed array that Chapel current lacks. The theory behind it is that we just create a static region of memory, and then reuse that memory in a new instance (added bonus of not needing to copy data over to the new data structure, just the pointer to the data the original instance used), and then use RCU to install that new instance. In other words, we add more indirection, which proves the adage correct that "All problems in computer science can be solved by another level of indirection".
After we have DistVector, then I can utilize that as the backbone for everything. I'm going to be using a global counter to keep track of the descriptor, although unfortunately this counter needs to be placed on a single node. Luckily this scales well enough, as my data structures have proven. When registering an object, we can simply increment the counter (if we cannot recycle a descriptor) and create a descriptor based on that. If we use this approach, privatization and resizing handled by DistVector, that essentially handles majority of the problem.
Problem: How do we recycle descriptors? After the data is no longer used, what happens to the slot that is no longer in use?
Potential Solution: Maintaining a global resizable 'free-list' using DistVector could work, or we could make use of DistDeque as a FIFO/LIFO way of reusing the descriptors. As in, when a descriptor is no longer used, we could first 'pop' or 'dequeue' first and utilize that directly in our table.
@gbtitus Any thoughts on this?
Update:
I have implemented the DistVector (seen here), although it is suffering from poor performance due to lack of time. The DistVector properly incorporates the proposed RCU model into it's base, and as such allows concurrent reads during writes, although again the data structure is much slower (in the test at least; in the test we intermingle reads and writes, but writes are 1000000x slower than reads, so we can't reliably benchmark reads here...). I need to perform a lot of profiling, as I'm sure I'm suffering from communication from something I thought I'd privatized, and I'm using resizable domains over a much more cache-coherent statically allocated buffer (IIRC resizable domains will allocate new space, copy everything over, and delete the old; this is a huge bottleneck and contradictory to my design of 'reusing the old memory and linking with the new'). I can predict that this model will be a success, and can even scale, as mentioned before, given some extra time. Unfortunately, I no longer have that time, as I must catch back up on my studies.
I'll try to get back to this after I graduate (December 16th), so await some promising results!
Fleshed out the idea more here: https://github.com/LouisJenkinsCS/Chapel-Atomic-Objects/issues/1#issue-278711689
Implemented array w/ RCU, tested on single node (doing more testing in distributed setting later), seems promising. Array only allows dynamic allocation and indexing, but for purposes of descriptor table is all I need.

I can't tell what the units on the chart are. Is higher better? Can you explain why RCU array is between Sync Array and Array?
Sure, but first let me add the results for 16 nodes...

I'll update this in a bit describing it (I actually started a blog recently, so I'll write it up there). Also, higher is better, its in terms of operations per second.
I have the early parts describing the RCU algorithm, but it isn't complete yet, keep that in mind. I'll continue updating it and leave another notification when I'm done.
@LouisJenkinsCS - I havn't finished reading that yet but it reminded me of a problem we're facing. We have associative domains and arrays, but if you add to the domain, it invalidates array references. That has the side effect that parallel operations on the array can't work. This comes up with say making a histogram in parallel when you don't know the keys beforehand. I.e. I'd like to be able to write:
var D:domain(int);
var Count:[D] int;
forall key in input {
D += key;
Count[key] += 1;
}
but in fact this program suffers from a race condition, because appending to the domain could cause the array to resize and then invalidate the reference to Count[key] used by another task in the forall.
Interesting example, and looking at it RCU would be able to solve the problem (although I'd hate to make it out to be a kind of solve-all panacea). In the case of allowing indexing to be reads and appending to be a write, it would work albeit extremely slowly. Each iteration would be a full-blown write, so in the end you're better off doing this serially, or better yet doing some kind of map-reduce operation where each task spawns its own map and you combine them all at the end. RCU is for situations where we have 99.9% reads.
Edit:
Actually, originally I planned to allow insertions into the data structure in a way that allowed writes to be relatively cheap (only synchronization on a single sync-var needed) so long as we allocated enough memory ahead of time. I removed it because it wouldn't suit my purposes and it slowed down the reads by 3x. Still, it is possible to achieve reasonable performance that would best coarse-grained synchronization, but it would still reduce overall performance.
Speaking of mappings... I actually had an idea for a distributed hash table based on my own published work I did for the Go programming language. It can actually be perfectly remodeled in a way that fits the distributed aspect (a bit too perfectly, truth be told), and would allow not only concurrent updates, but would scale so well I'm sure it'd put the default sparse distributions to shame (specific data structures definitely would outperform a general data structure at the specific task they are designed for after all)
@mppf
It is done (for now), I'm not sure if I'll go back and add more visual demonstrations (likely I will) but the textual parts are mostly done (likely with many typos).
Also, one more update, after removing 'sync array' as it causes runtime crashes (coarse-grained synchronization won't work at all then, which is even more interesting), I got it to run with 32 nodes...
DistVector is the array, so it wins by quite a bit 馃
Also, I see what you meant when you said it can be applied to chpl-privatization.c, as yeah the whole idea of allocating it in chunks and performing some RCU update that reuses the same memory (similar to my own array) should work marvelously. I'd do this myself, but I'm 100% certain how to test and verify the results after I finish.
For chpl-privatization.c - I think that having a RCU capability in our standard modules would be a big deal, and if you can get the RCU in by fixing leaks in chpl-privatization, that'd be a good intermediate point (for PRs). I think we can help (myself and Elliot, probably) with verifying any update to chpl-privatization.c.
After making corrections to the benchmark, I see that there was extra communications I didn't account for before, but the good news is that the distributed vector is still scaling extremely well (like its consistently 1/3 the performance of the unsynchronized distributed Chapel array), which is extremely good... Like, 300M Op/Sec vs 900M Op/Sec good!
Now, after seeing that QSBR-based RCU yields literally zero-overhead, I realize that right now if I could implement that Chapel-side this would likely bring the data structure actually ahead of (or at least neck-and-neck) with the normal Chapel array, while offering concurrent resizing & indexing.
Update:
I finally got the synchronized array to finish in time by lowering the number of operations per task to 1024 (down from 1M), the results are interesting and exciting.

Edit: Changed labels to be more accurate
So I was thinking about the true applicability and generality of the descriptor table... why stop at wide-pointers? In fact, its funny but the table could be used to allow atomic instances for structures as well. Essentially the descriptor acts as a layer of indirection similar to how a pointer does, so it would be possible to store at each slot a struct or record. In particular, for GlobalAtomicObjects it would be storing a struct/record, a wide_ptr_t. Its interesting actually... gone are wide pointers, replaced by descriptors.