Chapel: GlobalAtomicObject - Scalable Atomic Operations using Descriptor Tables

Created on 12 Jul 2017 · 75Comments · Source: chapel-lang/chapel

Global Atomic Object

(PR available: #6717 )...

Details

Below, I go over some details that are present in the current implementation

Pointer Compression

This is the common case, but certainly not a novel concept in and of itself nor to
Chapel itself, however it is sufficient for most cases where the number of locales
fit within 16 bits. It takes advantage of the fact that all known architectures and
operating systems only use 48 bits of the virtual address space, meaning we are free
to embed the 16 bits of the locale id in the upper 16 bits. This means for 65K locales,
we get the benefit of using atomics for a single 64-bit atomic operation, making
atomics extremely fast.

Descriptor Tables

This could be a new design pattern really, but the idea, while novel in its application,
is simple in theory. We maintain a table of objects, one for each locale, of which
we use the index and locale id as a descriptor (a pseudo-pointer). When we want to
atomically operate using a class instance, we first hash the class instance (it's 64-bit address)
and perform an insertIfNotPresent operation on the target locale, and then embed
the locale id in the upper 32 bits, and the index into the descriptor table into the
lower 32-bits. Compression is the most expensive part, as it involves doing the above
every time (although see improvements for a huge improvement), but decompression
is cheap in that it can easily directly index into the descriptor table.

Performance

I compare GlobalAtomicObject (using descriptor tables) performance against 'Sync Var',
the all-mighty sync variable, which currently gets a handicap by just acquiring the lock,
perform some arbitrary cheap operation, then immediately release it.
This is the 'baseline' which we must beat or else the implementation is useless.
The next is Atomic on a 64-bit value, which is the 'ceiling', and can be seen as how
much performance we can get for normal pointer compression versus keeping tables like this.
It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is
a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...

var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x));

There is no retrying, just a one-and-done atomic operation (which I count as one).
In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although
a few things need to be considered. While for 'Single', it performs a write
(because a read is not interesting since it is a basic lookup in a locale's table),
for data in local memory, which shows the ideal. Now 'LLSC' can perform operations for
data that is remote, meaning that each time, its possible that during compression
of the data, we may need to jump to other locales. It shows a more average case, but
not the worse case when data to compareExchangeStrong is both remote. Currently,
'LLSC' for GlobalAtomicObject demonstrates 3 atomic operations: the read which
is a simple lookup, and compression of both variables passed to compareExchangeStrong
which one is local, the other remote.

It should be noted again that with pointer compression and a smaller number of nodes,
performance is excellent. However, when one has more than the maximum allowed, it is
still manageable and better than a sync variable is. It should also be emphasized that
the runtime of GlobalAtomicObject with pointer compression is equivalent to Atomic.

I present two graphs: one with respect to runtime, and another in terms of raw operations
per second.

Time

This shows the runtime of the benchmark. As can be seen, raw atomic operations grossly
beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate
the power of global atomics. A single global atomic, while is magnitudes slower than a
single 64-bit atomic, is only 2x slower than the LLSC.

Imagine if we had attempted to implement some MCAS operation to update both locations using
descriptors. At best we would succeed within 4 LL/SC operations without obstruction
(2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat.
Of course, in reality, the operation would require multiple retries, in the hundreds to
thousands with real workloads that placed contention on it. This shows that, in terms of
global atomics, this is the best option currently, as a global atomic is guaranteed to succeed in one operation.

Operations Per Second

This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they
shadowed over everything else to the point that the performance boost wouldn't be noticeable.
As can be seen however, both a single global atomic operation, and even the LL/SC combo
operation with a potential remote reference (meaning an added on statement) beats a normal
sync variable. In every case, GlobalAtomicObject wins outright, but performance penalties from
remote references are very apparent and should be avoided at all costs.

Usage

Sample usage of GlobalAtomicObject can be seen below (and yes, it does work).

class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;

x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);

// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;

// Is Safe because we only use the pointer itself. However, in cases where 'a' and 'b'
// can be reused by the same GlobalAtomicObject, then this could cause undefined behavior
// where another concurrent task adds the same pointer. In those cases, you should call
// '_delete' before 'delete'.
x._delete(a);
x._delete(b);

// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted...

GlobalAtomicObject(C) can easily become atomic C. If the language were to be
modified, it wouldn't be too surprising to see the above become the below (which
of course lacks the excessive comments), which is an exemplary ideal that @mppf
insisted on:

class C { ... }
proc test() {
   var a = new C(1);
   var b = new C(2);
   var x: atomic C;
   var result:bool;
   x.write(a);
   result = x.compareExchange(a, b);
   assert(result);
   assert(x.read() == b);
   delete a;
   delete b;
}

Where delete can call some hook which also removes it from the table if it has not yet been
reclaimed. I believe this can become a reality with little effort.

Improvements

Below, I go over some potential improvements that can be made by others who are up to it.

Pointer Compression + Descriptor Tables

It may be possible to combine normal pointer compression for the lower locales that fit
within 16 bits. Combining both means we allow better performance on average, even for
cases when they have more than 16 bits worth of locales, any locales with an id
less than 2^16 get the power of atomics, while the ones greater than or equal
to 2^16 get to use the much-slower-but-faster-than-sync-variables descriptors.

See next idea for how to differentiate.

Zone Compression

A theoretical model where we use a more dynamic type of compression which attempts
to uses the lowest 32 bits for the virtual address, and divide the upper 32 bits
into zone bits and locale bits. A 'zone' is a base 64-bit offset identified by
zone id, which is allocated when a new unique id is found. With optimizations
such as caching the zone mappings for other locales (I.E: First time a unique upper 32
bits are found, check if we have it cached, if not found jump to the owning locale
and add it if not already, and update what we have cached, etc.)

The benefit here is that we get to keep using 64-bit pointer compression, and the amount
of unique zones that can be kept track of depend on the number of bits needed to keep
track of the locale id.

"What happens when we run out of zones?" - We use the descriptor
tables again. We can just search the unique zones for the target locale first, if the
zone isn't found and we can't create a new one, we just use descriptors.

"How do we differentiate between Zone Compression, Pointer Compression, and Descriptor
Tables?" - Its possible to use the lowest 2 bits. This means lowering the potential
maximum segments for descriptor tables, but it allows a much more dynamic compression
strategy.

Caching Descriptors into Objects

This is a change that isn't theoretical but requires changes with the language itself.
Majority of the overhead of descriptor tables is hashing the object, which requires jumping
to the owning locales, but if the descriptors were cached then repeated accesses would
be just about as fast as atomics. If it were cached, the benefits of recycling memory
to be used in atomics would be tenfold (or hundredfold, as the benchmarks show).

Concurrent-Safe Implementation

The table utilizes a simplistic skip list implementation on a segmented memory pool.
The memory pool allows indexing by allowing an easy translation from a 32 bit
index into a segment and segment data index. It utilizes a simple bitmap (BigInteger).
The implementation of the memory pool allows wait-free reads (hence indexing is fine)
but requires synchronization to recycle/reuse memory. The skip list used to manage
memory also requires synchronization, and a lock-free/wait-free implementation requires
proper memory management such as Hazard Pointers which requires thread-local or task-local
storage. Hence, synchronization is needed and we use a simple local sync variable.

If someone managed to make everything concurrent, it could improve performance further
by 5 - 10x at least.

Edit:

Removed Introduction to avoid further derailment of discussion.

Language Libraries / Modules Needs Design Review Design

Source

LouisJenkinsCS

All 75 comments

In any budding language, especially one that boasts to be a productive
concurrent/parallel language, atomics on all types are a necessity.

This statement strikes me as being a naive oversimplification. Is an atomic distributed array a necessity for any language that supports distributed arrays? I don't think so, and am not even sure what that would mean.

bradcray on 13 Jul 2017

👍1

Is an atomic distributed array a necessity for any language that supports distributed arrays?

The emphasis would be on productivity. "What can I do if X were atomic compared to if it were not". If X is a distributed array, what do we gain? Each element in a distributed array can be updated atomically as is, so perhaps not. The distributed array could be made atomic by allowing concurrent access while resizing, which is a huge plus and would further enable productivity, sure. Would I say that this is a necessity to allowing the user to be more productive? Arguably. However, that's not the case here.

Since X is a class instance, the amount of productivity gained by making it atomic is limited to the stretch of the imagination. You can:

1) Implement more complex algorithms and data structures, like scalable wait-free algorithms
2) Allows any local algorithm to be converted into a distributed algorithm
3) Literally becomes a universal construct that allows us to make any algorithm become distributed (RCU like in the case of atomic distributed arrays, STM, garbage collection, what-have-you)

Is the above a necessity? Absolutely. However unlike atomic distributed arrays, I have presented an actual solution which works and scales.

Edit:

For more clarity: Atomics for Class Instances is a necessity, but let me expand on the universal construct bit for a moment. If you have any type of data, you can create a field in an class instance. That's nothing special, but what it allows is: any type of data, whether an int, a string, a record or some distributed array, lets call it data, which can be a field in some wrapper class instance. If you wish to atomically update it, imagine the follow:

var globalData : atomic Obj;

// Read...
var current = globalData.read();

// Copy...
var modified = clone(current);
modified.field = data;

// Update...
globalData.compareAndSwap(current, modified);

You have potential RCU. Of course the next actual difficulty would be memory reclamation, but the point is that it would have been the same whether the algorithm was distributed or not.

Furthermore, if it sounded like I was bashing Chapel, I'm not, I wish to see it improve and as such I am offering suggestions to its problems.

LouisJenkinsCS on 13 Jul 2017

I think you're misinterpreting my comment. I don't have any objections to supporting atomic classes in Chapel, and have long believed that we should do so. My objection was with the opening blanket statement which I read as "any new language that has type X must necessarily have atomic type X." And I simply don't believe that to be true (and used "distributed array" as my example X). As a result of my reaction to the statement (+ time constraints), I rolled my eyes and stopped reading there rather than being intrigued and continuing on. Then eventually I came back to comment on my reaction.

To me, a more believable, less absolutist, and less objectionable statement would be something like "Any budding language, especially one that strives to be a productive concurrent/parallel language, should consider for each supported type T whether there would also be value in supporting an atomic variant of T." This statement seems far harder to argue with to me.

bradcray on 13 Jul 2017

Understood, it makes sense that it made a broad generalization, but I didn't expect it to have the kind of effect which would cause people to skip over it from the first sentence. I revised it so that it didn't generalize to any type, but to a specific type of data.

LouisJenkinsCS on 13 Jul 2017

Thanks!

To make sure I'm understanding the following:

let me expand on the universal construct bit for a moment. If you have any type of data, you can create a field in an class instance. That's nothing special, but what it allows is: any type of data, whether an int, a string, a record or some distributed array, lets call it data, which can be a field in some wrapper class instance.

Are you saying that the original statement was intended to convey "Supporting atomic classes in a language effectively adds support for arbitrary atomic types as well because given any type T we can create a class C_T with a field of that type and now by declaring an atomic C_T, we effectively have an atomic T"?

bradcray on 13 Jul 2017

Are you saying that the original statement was intended to convey "Supporting atomic classes in a language effectively adds support for arbitrary atomic types as well because given any type T we can create a class C_T with a field of that type and now by declaring an atomic C_T, we effectively have an atomic T"?

Technically, it can yes. The semantics would change (requires RCU or similar updates), performance may drop significantly, but it is possible to emulate atomic T with atomic C_T with effort (even if it is a lot). The point I was trying to make was that it necessary in that it made such things possible (even if they weren't optimal), but that was before you informed me the issue was because my generalization (of which I may have done again).

LouisJenkinsCS on 13 Jul 2017

👍1

@gbtitus and @ronawho are planning to look at this proposal. (and me too, but I've already looked at it enough to be comfortable).

mppf on 17 Jul 2017

@mppf

I had another idea in terms of how it can be implemented into the language itself. What if you had one global table (not coupled to a GlobalAtomicObject, but initialized first thing at runtime) that was managed across all nodes. Management of the objects (adding, propagating changes across nodes, removing them when the object is disposed of with delete, etc) is up to you (and I'm sure you guys can think up much better solutions), but this allows us to use the whole 64 bits. Embedding the 64-bit descriptor number (index into the table) into the object is still a valid optimization here, and we allow 2^64 potential objects. The compiler can properly instrument read, write, etc.

Boom, you've got your global descriptor table. The low level details would be beyond me though.

LouisJenkinsCS on 18 Jul 2017

@LouisJenkinsCS - I'm not sure what to interpret from your other idea other than that we could come up with a completely different solution from what you created, but I already knew that. But I'm probably missing something. What would be the advantage of having one global descriptor table?

mppf on 18 Jul 2017

@LouisJenkinsCS: For some reason I'm feeling like one of the edits to the original proposal removed a lot of the context. Basically the need is for AMOs on global pointers which conceptually are larger than 64 bits (and we don't have 128-bit network AMOs), and the specific operations we need are just Load, Store, Swap, and Compare-and-Swap? Is that correct?

gbtitus on 18 Jul 2017

@mppf

Allow me to contrast having one single global descriptor table to having one coupled to GlobalAtomicObject (or each atomic Obj).

1) It makes caching the descriptor possible in O(1) space, as it can be added directly to the object's header the first time it is used in any atomic operation (via checking if its cached descriptor is set). If you have a unique table coupled to each atomic Obj, then the cached descriptor is specific to each instance of atomic Obj.
2) There can be one table to rule them all, in that the table could hold 2^64 arbitrary wide_ptr_t data, and have it store any kind of object. Currently, you'd need one descriptor table per atomic object, and even if you had one table per runtime type, a lot of space will be wasted. This can be done in just one extra level of indirection.
3) atomic Obj can become just an arbitrary descriptor index.

Edit:

To backup whether this is more than enough: as you stated, an object must take up at least 32 bytes, so we're talking about 590.29581 exabytes worth of data before space becomes an issue.

LouisJenkinsCS on 18 Jul 2017

@gbtitus Yes, originally I had stated that in the removed introduction, and yes we only need those operations.

LouisJenkinsCS on 18 Jul 2017

@LouisJenkinsCS - I think using a single global table is interesting as an improvement. Instead of asking us to solve all the problems - can you describe what operations should happen when to support such a thing. What would you need the compiler to do? What does the runtime do? I would expect as we integrate more with the compiler/runtime you'll need help to do so.

However I think we should view it as an improvement/optimization and try to get a 1st version merged. I.e. I think we should discuss the global version later and in a different issue. For the purposes of this issue, I think it would nice to simply say why it is harder (what are the challenges?).

mppf on 18 Jul 2017

I feel like I must be missing something obvious. One solution would seem to be to implement those AMOs by simply doing an on-stmt to the node where the AMO needs to be done and then doing it as a 128-bit processor AMO (either directly if there is CPU support, or indirectly using the x86 128-bit compare/exchange instruction CMPXCHG16B). In the current implementation this would take 2 network round trips per AMO, all things considered. What are the advantages of the proposed solutions over doing that?

gbtitus on 18 Jul 2017

@gbtitus

~~The issue would be atomically updating them. What happens when you have another atomic operation begin while another is ongoing?~~ (Misread what you meant by doing an on-statement, yes that would work if there was hardware support). In terms of the current implementation, what it does is allow us to have an immutable pseudo-reference to a wide pointer that we know won't change and can be represented by a normal 64-bit integer. I'm not sure how CMPXCHG16B is implemented, can you fill me in on the specifics? Does it do a double LL/SC that causes many operations to retry if they fail, or does it atomically operate on whole 128-bit data words? Regardless however, not all hardware have support for such operations and this would be a software implementation to fallback on.

LouisJenkinsCS on 18 Jul 2017

@mppf

With the current version, we can do the following...

1) Implement atomic C as new GlobalAtomicObject(C).
2) Add primitives such as __primitive("get cached descriptor", this, obj) and __primitive("set cached descriptor", this, obj), which can potentially map the current GlobalAtomicObject to the object's descriptor.

Currently, that's all that would be needed to get a nice efficient implementation going.

LouisJenkinsCS on 18 Jul 2017

CMPXCHG16B is a 16-byte atomic compare-exchange. The LOCK prefix is needed in multicore situations. It's true that this operation is not necessarily present on all architectures -- it's not implemented on early AMD x86_64 and there may well be other processor architectures that don't have it at all. On those one would need to do something else.

However, I'd argue that the "something else" should not do any network ops. In other words, no matter what actual mechanism we use for the pointer updates, it should only involve local operations, because network communication is very costly. I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.

That still leaves a lot of maneuvering room for implementation, though. For one thing, we might be able to go a bit further with straight pointer compression. If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.

The bluntest possible tool might be something like maintaining, on every node, a local array of sync vars index by node, and protecting all updates to global pointers with a given locale value by locking the corresponding local sync. (I did say it was a blunt tool!) Speaking of which, in your comparison with sync var protection at the top, where (which node) were the sync vars involved, and was there one per global pointer, or just one for everything, or something else? Would it be possible for you to post your scaling comparison code(s)?

gbtitus on 18 Jul 2017

For some reason I'm feeling like one of the edits to the original proposal removed a lot of the context.

I also got confused by trying to read the evolving document -- started last week, couldn't find my place this week.

bradcray on 18 Jul 2017

CMPXCHG16B is a 16-byte atomic compare-exchange.

Is there a similar atomic operation to perform reads, writes, and exchanges for 16 bytes? If it is anything similar to a Double-Word Compare-And-Swap, then wouldn't its purpose be to update two arbitrary 64-bit locations? While each location may be updated atomically, if you had to read each 64-bit value separately, then couldn't another atomic CMPXCHG16B operation could change its value mid-read?

However, I'd argue that the "something else" should not do any network ops.

Local to what? If the data refers to something remote and another remote node can atomically set it, then network operations are unavoidable. Communications across nodes are costly, but they do scale. If bounded, they are a constant overhead that can be managed.

I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.

Not necessarily. Again, by caching the descriptor, this means that the only remote operation occurs the first time the object is used. Therefore, this remote operation merely occurs once. The other operation is using network atomics to set the 64-bit atomic field, which is also unavoidable. A network atomic operation would beat using an on statement any day (or at least in Chapel it does, I don't know how well optimized you can make it in the runtime).

If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.

That's an interesting approach, definitely one that can be used in the meantime, but relatively soon you would run out of bits.

Would it be possible for you to post your scaling comparison code(s)?

Benchmark.

LouisJenkinsCS on 18 Jul 2017

I'm going to respond to the parts of this separately, since I suspect some will be dealt with and disappear rapidly while others lead on to further conversation ...

CMPXCHG16B is a 16-byte atomic compare-exchange.

Is there a similar atomic operation to perform reads, writes, and exchanges for 16 bytes? If it is anything similar to a Double-Word Compare-And-Swap, then wouldn't its purpose be to update two arbitrary 64-bit locations? While each location may be updated atomically, if you had to read each 64-bit value separately, then couldn't another atomic CMPXCHG16B operation could change its value mid-read?

No, it's an octoword compare-exchange, that is, it operates on an single aligned sequence of 16 bytes, not on two separate 8-byte quadwords.

gbtitus on 18 Jul 2017

@gbtitus - I think it'd be reasonable to do an on with a CMPXCHG16B if CPU support is available. I view the approach @LouisJenkinsCS is trying here as the fall-back - in case 128-bit processor atomics are not available. Additionally I can't predict if 64-bit compare-and-swap but extra work on creating/destroying a instance for the atomic object would be faster for a given application.

A separate concern is that some algorithms might want to use a generation number in the other 8 bytes, so CMPXCHG16B could protect against the ABA problem.

@LouisJenkinsCS - I'm pretty sure CMPXCHG16B only updates contiguous 16 bytes.

mppf on 18 Jul 2017

@gbtitus - I'm not so sure the descriptor table results in more communication. @LouisJenkinsCS - please correct me if I am misunderstanding the strategy here.

In particular a typical pattern with compare-and-swap for objects is something like this:

make a new object
fill in some fields
compare-and-swap it in to place in an atomic

e.g. updating a single-linked list in a lock-free manner.

Since the creation of the object occurs on the same place doing the compare-and-swap, the descriptor table that is updated is the one that is local to that node. Updating the descriptor table gives a 64-bit identifier for the class instance, which is then CAS'd in.

Of course, if another node goes to use the 64-bit identifier (say by reading the atomic class instance and then dereferencing it), the (remote) descriptor table will need to be consulted to get the wide pointer corresponding to that identifier.

But, if a node is dereferencing a remote class instance, there will be communication anyway. (Unless some caching strategy is used, say --cache-remote, but that could also cache the remote descriptor table).

mppf on 18 Jul 2017

However, I'd argue that the "something else" should not do any network ops.

Local to what? If the data refers to something remote and another remote node can atomically set it, then network operations are unavoidable. Communications across nodes are costly, but they do scale. If bounded, they are a constant overhead that can be managed.

Local to the place where the memory is being updated. This does get confusing, because we have two notions of locality. One is the node where the pointer points, that is, the node part of the global pointer. The other is the node where the global pointer resides, that is, where it lies in memory. In this case I'm talking about the latter. (I got confused about this myself in another part of the conversation, which I'll address in a few minutes.) I'm saying that when we update a memory location containing a global pointer, it should be sufficient to just do an on-stmt to the locale where that memory location is and then change it using only local operations, no network communication. And I think you'd say that a Descriptor Table-based solution actually obeys this rule because even though the table access conceptually involves network communication, in practice by caching table entries we can avoid the network.

gbtitus on 18 Jul 2017

I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.

Not necessarily. Again, by caching the descriptor, this means that the only remote operation occurs the first time the object is used. Therefore, this remote operation merely occurs once. ...

Got it, I hadn't snapped to this being the purpose of the caching.

... A network atomic operation would beat using an on statement any day (or at least in Chapel it does, I don't know how well optimized you can make it in the runtime).

Yeah, an executeOn (the runtime moral equivalent of a Chapel on-stmt) can't even get close to a network AMO except in very specialized circumstances, and certainly can't ever match or beat it.

gbtitus on 18 Jul 2017

One thing I'd need to cleared up here is this: Are we talking about the current implementation, or the ideal implementation?

In both cases... @mppf That is correct.

LouisJenkinsCS on 18 Jul 2017

If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.

That's an interesting approach, definitely one that can be used in the meantime, but relatively soon you would run out of bits.

Sure, depending on the definition of "soon", heh. But keep in mind that you only have to use this technique as long as you don't have an 16-byte compare-exchange AMO. If that's getting more common faster than large-scale systems are threatening to grow over 64k nodes (so straight pointer compression can be used), then it's the way to go. Plus it's not actually 64k-node systems we have to worry about, rather it's 64k-node distributed Chapel programs, and other scalability constraints may prevent that for a while anyway. So many engineering tradeoffs ...

gbtitus on 18 Jul 2017

I said:

The bluntest possible tool might be something like maintaining, on every node, a local array of sync vars index by node, and protecting all updates to global pointers with a given locale value by locking the corresponding local sync. ...

This is nonsense. I confused the node in the global pointer with the node on which the global pointer was being updated. In general if I'm updating a field containing a global pointer, the node numbers in the old and new values of that field are not the same and I can't do what I described. So, never mind this idea!

gbtitus on 18 Jul 2017

keep in mind that you only have to use this technique as long as you don't have an 16-byte compare-exchange AMO

Okay, this brings up my previous question. Is there a way to atomically read/write/exchange 16-byte data? Atomic updates of any class instance globally I believe would be significant. I've looked around for anything like read16b, write16b, etc, and can't find anything. My question is: if you try to read a wide pointer and someone else does a CMPXCHG16B, would it read what was there before CMPXCHG16B finished, after, or in between?

LouisJenkinsCS on 18 Jul 2017

Okay, this brings up my previous question. Is there a way to atomically read/write/exchange 16-byte data? Atomic updates of any class instance globally I believe would be significant. I've looked around for anything like read16b, write16b, etc, and can't find anything. My question is: if you try to read a wide pointer and someone else does a CMPXCHG16B, would it read what was there before CMPXCHG16B finished, after, or in between?

I think you'd have to emulate read/write/exchange by means of CMPXCHG16B itself.

gbtitus on 18 Jul 2017

I think you'd have to emulate read/write/exchange by means of CMPXCHG16B itself.

That definitely is something I would like to test. (Note I am assuming that it is implemented in the hardware in such a way that resembles a normal compare-and-swap on 8 bytes) That means that you have a lot of intra-node traffic being done (and inter-node traffic with respect to NUMA nodes) just to access data in that manner. If you have a large amount of traffic (say each task from each node in a cluster accessing one atomic field, which isn't too uncommon for distributed data structures), then having all of those on statements would result in a massive performance loss. Even normally wait-free atomic operations like write and exchange just become lock-free. Add to that the fact that you have so many remote tasks absolutely overwhelming the node on which the memory lies on, and I think perhaps descriptor tables at the idealized 'everything-cached' solution may be a viable alternative, in that it avoids the 'CAS retry problem' bottleneck.

However, if the x86 LOCK prefix provides mutual exclusion at a hardware level with no need to retry, then it boils down to acquiring a (very low-level) spinlock for all operations, right?

LouisJenkinsCS on 18 Jul 2017

Would it be possible for you to post your scaling comparison code(s)?

Benchmark.

The sync code seems rather unfair, since you're using a program-wide sync var to single-thread everything. Probably a production implementation would do something that scaled better, such as having a number of sync vars on each locale, say a multiple of the core count so as to reduce contention, and then for the update, doing an on-stmt to the locale where the field lived, somehow using the field address to select a sync var, and then using that sync var to protect the update. This would scale. I'd believe you if you argued that it wouldn't scale as well as solutions using network AMOs could. Nevertheless, when comparing to an alternative technique you don't like, it's always more convincing to show that alternative in the best possible light, not the worst. If you could do this and rerun, that would be great to see.

Also, for the code where you're using CompareExchange, I think you need an additional inner loop. The code at this line, for example, doesn't necessarily replace the value x with the value x + 1 in the variable atomicValue. It only does so if atomicValue actually contains the value x at the time the operation is done. I think instead you'd need something more like this:

const x = atomicValue.read();
while !atomicValue.compareExchangeStrong(x, x + 1) do chpl_task_yield();

Basically, this runs until comparison succeeds and the exchange occurs. (The yield is there to avoid starvation, if some other task needs to run in order to create the situation that allows you to make progress.)

gbtitus on 18 Jul 2017

The sync code seems rather unfair, since you're using a program-wide sync var to single-thread everything.

You are definitely right, the syncOp benchmark uses a very naive implementation, but I also gave it a handicap as well. I was going to compare and contrast the overhead of a read, write, exchange, and compareExchange for sync, but any performance difference would be negligible as 99% of the overhead is from acquiring a remote lock. Ignore what is going inside of the lock, just the fact that a lock is acquired to do something.

Probably a production implementation would do something that scaled better, such as having a number of sync vars on each locale, say a multiple of the core count so as to reduce contention, and then for the update, doing an on-stmt to the locale where the field lived, somehow using the field address to select a sync var, and then using that sync var to protect the update.

The field is located on one locale, so in this case fine-grained locking won't work. As well, all examples here are rather simplistic in nature, but you do raise a valid point on performing an on statement on the node that owns it, which would definitely be better than remotely acquiring it, its just that the lock in this case is protecting one piece of data, and cannot be made any more granular than that.

Also, for the code where you're using CompareExchange, I think you need an additional inner loop. The code at this line, for example, doesn't necessarily replace the value x with the value x + 1 in the variable atomicValue. It only does so if atomicValue actually contains the value x at the time the operation is done.

See the 'Performance' section in the original post. I chose to avoid adding contention for CAS retrying because I know that a CAS retry loop won't scale. In fact, I'd argue that a sync variable can scale better than a 'CAS Retry AMO` can due to the fact that the contention would cause a significant decline in performance as we add more tasks.

Clarification

In summary: I will rerun the benchmark with LL/SC being a retry-loop, and have syncOp use an on statement.

LouisJenkinsCS on 18 Jul 2017

@gbtitus

Actually, perhaps I could construct a better benchmark, one in which we fairly compare both sync and GlobalAtomicObject and leave normal atomic out in general (as it has to do with atomic operations using class instances). Would you agree to that as a better benchmark? I'm thinking of constructing a list... It goes like this...

Global Atomic Object List

var head : GlobalAtomicObject(node);
var tail : GlobalAtomicObject(node);
var ourNode = new node();
var tailNode = tail.exchange(ourNode);
if tailNode == nil {
   head.write(ourNode);
} else {
   tailNode.next = ourNode;
}

Sync Variable List

var lock$ : sync bool;
var tail : node;
var head : node;
lock$ = true;
var n = new node();
if tail == nil { 
   tail = n;
   head = n;
} else {
   tail.next = n;
   tail = n;
}
lock$ = false;

In this case its fair in that they do roughly the same (although the GlobalAtomicObject isn't guaranteed to have everything be visible immediately).

Results (Time:

NumLocales | Global Atomic List | Remote Sync List | On-Stmt Sync List
------------- | ---------- | -------------------- | --------
1 | 0.111723 | 1.10026 | N/A
4 | 13.1698 | 106.642 | 76.1085

Any more and it would take far too long to be worth it. In this case, GlobalAtomicObject wins, but that's because the algorithm for the list wait-free and doesn't care about when nodes become visible, but again its meant to demonstrate raw atomic operations here. Code here. I'd like to emphasize again that I only recommend usage of global atomics for things that are wait-free or bounded in terms of # of retrying. Of course, if need be one can employ the "fast-path slow-path methodology -
slide paper"

Edit:

The performance boost in GlobalAtomicObject is likely due to switching from using Chapel arrays to tuples.. which is actually fascinating!

LouisJenkinsCS on 18 Jul 2017

I think you'd have to emulate read/write/exchange by means of CMPXCHG16B itself.

That definitely is something I would like to test. ... That means that you have a lot of intra-node traffic being done (and inter-node traffic with respect to NUMA nodes) just to access data in that manner. If you have a large amount of traffic (say each task from each node in a cluster accessing one atomic field, which isn't too uncommon for distributed data structures), then having all of those on statements would result in a massive performance loss.

Just a quick note in case there's been confusion: it wouldn't necessarily be required that we do full Chapel-level on-stmts to implement this stuff using CMPXCHG16B. It could be done with direct runtime support for a low-level equivalent ExecuteOn which would have much lower overheads (and wouldn't need a full-blown Chapel task on the remote side). So, I think it's fair to interpret any use of the term "on-stmt" in here as referring to some kind of remote execution but not necessarily a full Chapel on. That said, the Active Message handling in support of the remote execution would still be required, and that's the largest cost difference between a network AMO, and a processor AMO initiated remotely.

Even normally wait-free atomic operations like write and exchange just become lock-free. Add to that the fact that you have so many remote tasks absolutely overwhelming the node on which the memory lies on, and I think perhaps descriptor tables at the idealized 'everything-cached' solution may be a viable alternative, in that it avoids the 'CAS retry problem' bottleneck.

"Absolutely overwhelming" seems hyperbolic. Here's why. I believe that Read, Write, Exchange, and in most usage even Compare/Exchange itself (e.g. when used for pointer swinging) can be implemented using remote execution of a processor AMO CMPXCHG such that the CAS retries are done locally on the remote node, not over the network. Now, either the field we're updating is contended or it is not. If it is contended then even if we arrange (by a descriptor table, say) to only need 64-bit CMPXCHG we'd still better not do our CAS retries over the network because network refs are 1000x slower, recall. So if the update is contended we want to use remote execution and processor atomics even if we do have 16-byte network AMOs. If the update is not contended, on the other hand, then because remote execution is definitely slower than network AMOs (probably no better than 1/4 as fast), a solution based on 8-byte network AMOs, such as compressed pointers or descriptor tables, will definitely outperform one based on remotely initiated 16-byte processor AMOs. Nevertheless, the node won't be "overwhelmed", not by an un-contended update. Un-contended updates can't overwhelm anything.

(Just to cover all the bases: if remote execution of processor atomics is 4x slower than network atomics then granted there is a range of rates for inbound un-contended updates such that network atomics can do them at full speed while remote execution and processor atomics will be throttled. But such update rates can't be reached without really serious workload imbalance, for example all the nodes pounding on one of them, and I'd argue that this is a special case we need not design for.)

However, if the x86 LOCK prefix provides mutual exclusion at a hardware level with no need to retry, then it boils down to acquiring a (very low-level) spinlock for all operations, right?

You need the LOCK prefix for any x86 CMPXCHG atomic op instruction if you want it to be safe in a multicore setting. As far as I can tell the CMPXCHG16B instruction works exactly like every other x86 atomic op in this regard. There's nothing special about it.

gbtitus on 18 Jul 2017

If the update is not contended, on the other hand, then because remote execution is definitely slower than network AMOs (probably no better than 1/4 as fast), a solution based on 8-byte network AMOs, such as compressed pointers or descriptor tables, will definitely outperform one based on remotely initiated 16-byte processor AMOs.

You've convinced me that in most cases your approach to using hardware atomic operations would be the best general solution, but it has me wondering if the user should have an option to pragma that a certain field or local variable is not a point of major contention and perhaps use an approach that is 4x faster. Just an idea.

CMPXCHG16B instruction works exactly like every other x86 atomic op

Gotcha, I definitely should have done some more research first. That further dissolves any doubt I have.

LouisJenkinsCS on 19 Jul 2017

All in all I think at this point I might favor something based on compressed pointers, myself. Even when we have CMPXCHG16B the remote execution needed to use it is no better than 4x slower than network AMOs and in real-world use may be much worse than that. The compressed pointer solution wins out over the descriptor table solution for me due to avoiding (as I understand it) the need to do anything at all when a class instance is created or destroyed, and the complication of caching table entries. In fact, really all it does is change our wide pointer representation from this (as expressed in C):

struct {
  uint64_t node;
  uint64_t ptr;    // cast to type* to use
};

to this:

struct {
  unsigned int node : 16;
  unsigned int ptr  : 48;    // cast to type* to use
}

Granted that we're limited to 64k nodes with this technique, which would make it architecturally our tightest constraint. That is, it would be the place where we were closest to not being able to reflect the whole architecture. But we could apply it only to atomic class instances as opposed to all class instances if we wanted to and as I think I mentioned before, we're still probably some way off from being able to effectively scale a Chapel program to this size anyway.

gbtitus on 19 Jul 2017

@gbtitus

Could we at least strike a compromise: compressed pointers built into the language, but GlobalAtomicObject available as a potential module for those who wish to use more than that? Again, it does work, it may as well be offered, right? That way if there is enough demand, perhaps later the idea of integrating it into the runtime can arise.

LouisJenkinsCS on 19 Jul 2017

... wondering if the user should have an option to pragma that a certain field or local variable is not a point of major contention and perhaps use an approach that is 4x faster.

Agreed -- I think the language could benefit from a notation that a particular atomic var (field or otherwise) is expected to be contended or not. This is useful information both for programmers maintaining the code later, and for the compiler/module/runtime which together arrange to implement the atomic var. I'm not particularly a fan of pragmas, myself. Maybe we could say contended atomic instead of plain atomic. :-)

gbtitus on 19 Jul 2017

👍1

I'm just catching up on this issue and I have a few things to say.

I think the discussion conflates two issues:
a. using a network AMO in a CAS-loop (at all, and in particular for remote AMOs)
b. the descriptor table approach vs using compression/CMPXCHG16B
1. I'm not as confident as Greg is at guessing which designs here will be better. In that veign, I'd like to see some performance comparison for a some plausible benchmarks:
2. list example as you have it
3. another plausible example that uses a CAS loop
  - one variant using an on statement around the CAS
  - one variant not using an on statement around the CAS

And then, is it possible to write the benchmarks in a way that makes a configurable amount of contention?

Then I'd like to see comparison of these approaches:

the global descriptor table approach
pointer compression and 64-bit CAS
CMPXCHG16B
(if you want, using a sync var in place of the atomic object in the most naive way)

Then, we can gather evidence to suggest if Greg is right about which algorithms to prefer in high contention vs low contention.

Lastly, it might be that pointer compression + CMPXCHG16B is enough; we might be able to reasonably assume that a system will support one or the other.

mppf on 19 Jul 2017

Also, I think an important future step (not necessarily for this issue/PR) is to investigate how one can solve the ABA problem for lock/wait-free multilocale programming in Chapel.

mppf on 19 Jul 2017

I still have a few responses to write on this issue, but none of them are big and I expect we're winding down toward conclusion.

In the meantime, a much more fundamental question occurred to me last night long after I'd left work: do we really expect that people will write the kind of pointer-swinging code in Chapel that they do in, say, C? I'm not asking whether they'll want to write lock-free or wait-free algorithms; certainly they will. What I'm asking is: when they write those algorithms in Chapel will they do so in a C-like way by pointer-swinging class instance refs, or should we expect (and even encourage) them to do it in a more abstract way using other data structure representations?

gbtitus on 19 Jul 2017

Could we at least strike a compromise: compressed pointers built into the language, but GlobalAtomicObject available as a potential module for those who wish to use more than that? Again, it does work, it may as well be offered, right? That way if there is enough demand, perhaps later the idea of integrating it into the runtime can arise.

I'm not averse to this but I think it's more a question for some other reviewer (looking your way, @mppf). My expertise is in mostly lower in the software stack.

gbtitus on 19 Jul 2017

@mppf

another plausible example that uses a CAS loop

I can do the next best thing. I'll try to globalize a current (well-known) data structure to test how they handle massive amounts of contention: Michael Scott's Lock-Free Queue. The queue's performance drops as you add more threads due to excessive contention placed on the head and tail, so at this point it comes down to 'How much'. I can do it based on runtime to test it. However, it should be noted again that we still have some issues with a lack-of-caching, but I can probably remedy this by allocating the queue nodes and caching their descriptors ahead of time. This could be very interesting... (I'll leave the sync variable out though).

LouisJenkinsCS on 19 Jul 2017

Forgot to ask: how would I test the cmpxchg16b thing from Chapel? Or would it have to be called from C?

LouisJenkinsCS on 19 Jul 2017

@LouisJenkinsCS:
You'd have to do it from C and even then, probably only gcc supports it and that not very well. I went looking at that stuff and it's sort of half-supported. There's an __int128 built-in type and the atomic compare-and-swap intrinsics can work on it, but you have to throw a compiler option (-mcx16) to enable that. So I'm not sure about the practicality of depending on it for production code right now.

gbtitus on 19 Jul 2017

Hm... @mppf Any suggestions on how to proceed on testing this? Is there a way to have chapel build with -mcx16 flag added?

LouisJenkinsCS on 19 Jul 2017

I thought cmpxchg16b would just amount to some assembly code snippet (in actual assembly or assembly-in-C). I wouldn't rely on GCC intrinsics for it at this stage. Actually it's used in qthreads and gasnet - see https://github.com/chapel-lang/chapel/blob/master/third-party/qthread/qthread-src/src/threadqueues/nottingham_threadqueues.c#L163 and https://github.com/chapel-lang/chapel/blob/master/third-party/gasnet/gasnet-src/gasnet_atomic_bits.h#L904 - perhaps if we ask nicely it one of these could be pulled out into its own header file.

Anyway I do think the experimentation I described is worthwhile as we decide what needs to be directly supported in the language.

mppf on 19 Jul 2017

Gotcha, I'll try to have it and the results by Friday

LouisJenkinsCS on 19 Jul 2017

Here's one more little bit of possibly useful information. I've been estimating that remotely initiated processor atomics probably can't perform much better than 1/4 the speed of network atomics. It turns out this is really easy to test. On a Cray XC with comm=ugni, we can toggle the ra-atomics test between doing network AMOs and doing remotely initiated processor AMOs just by whether or not we have a hugepage module loaded. In order to use network AMOs we have to register memory with the NIC and in order to do that we have to put the heap on hugepages. No hugepages, no network AMOs. And when we do use remotely initiated processor AMOs it's through a fairly lightweight runtime-internal mechanism, not a Chapel on-stmt. So this is a pretty good analogy to the situations we've been discussing. (In this case the AMO is a XOR rather than a CMPXCHG, but I doubt that matters.) It turns out that network AMOs are 30x faster than remotely initiated processor AMOs for this test, not just 4x or 5x faster. Now granted my 4x was a floor and I haven't dug into why the difference is so much greater than I expected (I have more ideas than time) but I thought I should mention this result.

gbtitus on 19 Jul 2017

👍1

Interesting... while it still doesn't leave me with much room for doubt that remote AMO is better for contention-heavy CAS-retry loop, it does give me hope for descriptor tables. I wonder though, does that 30x increase or decrease as you add more nodes?

LouisJenkinsCS on 19 Jul 2017

Yes, definitely any CAS-retry spin loop should be done on the node where the target datum resides rather than over the network. What's tricky is that if you have an appropriate network AMO and don't need to spin then you should definitely use the network AMO, because remote execution is so much slower. Thus knowing whether contention will occur is really important.

The 30x increases as I add nodes: roughly it's 19x on 2 nodes, 30x on 4 nodes, 34x on 8, and 37x on 16 in a set of runs I just did. I think this just reflects that as the node count rises, a greater proportion of network AMOs have to traverse the network. ra-atomics has a block-distributed target array that spans all the nodes, and parallel tasks on all cores and nodes do atomic XOR updates to randomly selected elements of that array as fast as they can. On XC all these updates are done via network AMOs through the NIC. If the update happens to hit an array element local to the node then there is no network traversal, but we nevertheless do have to use a NIC AMO instead of a processor one. This is because the Aries AMO cache is not coherent with the processor cache(s) and thus, for a given memory location that is an AMO target, all AMOs to that location must be either done with the processor (even if the location is remote) or done with the NIC (even if the location is local).

gbtitus on 20 Jul 2017

@mppf @gbtitus

In such short notice, I'm afraid I can't do too much right now, and I'll have to devote the rest of my time for catching up on my project (I'll be falling behind at this rate, to be honest). The below shows three things: 'GDTWrites' which is a simple write using the descriptor tables (no cache, meaning worst case when using local data), GDTReads which is a simple read using the descriptor tables (no cache), and 128bit-Read which is an emulated read using the only 128-bit hardware instruction available: cmpxchg16b. It is emulated in as an efficient manner as possible:

inline void read128bit(void *srcvp, void *dstvp) {
  uint128_t *src = srcvp;
  uint128_t with_val = *src;
  uint128_t cmp_val;
  uint128_t *cmp = &cmp_val;
  uint128_t *with = &with_val;
  char result;

  __asm__ __volatile__ ("lock; cmpxchg16b (%6);"
                        "setz %7; "
                        : "=a" (cmp->lo),
                        "=d" (cmp->hi)
                        : "0" (cmp->lo),
                        "1" (cmp->hi),
                        "b" (with->lo),
                        "c" (with->hi),
                        "r" (src),
                        "m" (result)
                        : "cc", "memory");

  *(uint128_t *)dstvp = cmp_val;
}

It is as optimized as you can get: regardless of if a CAS fails, it returns what is read at that location in cmp, which is why I have return that, so it is still a wait-free operation in that it only performs 1 CAS operation, and it is the only way to safely read all 128-bits atomically.

The results of the benchmark are surprising. The descriptor table won outright, by far, and as I mentioned before, it scaled. Remember this benchmark is in terms of raw time. As well, the benchmark consisted of many rapid operations in a 'for-each-task-on-each-locale' kind of way.

However, @mppf I promised to deliver on more benchmarks, and I'll try to do so at the earliest time possible, potentially over the weekend I'll be able to do so, but if not then hopefully sometime within next week. Again, I need to catch up on my project first and foremost.

LouisJenkinsCS on 20 Jul 2017

My thoughts on why the Descriptor Table won (so far) is the following...

1) Assuming an equal balance of work from each node, since compression/decompression takes place on the node that owns the object, the computational work load is also balanced. This is why the hardware approach drops in terms of performance because all nodes must jump to a single node. Also as well, in this benchmark work that is added is created locally, so there's no need to jump to another node. Unfortunately my explanation isn't anywhere near as in-depth as @gbtitus but hopefully this can suffice.
2) Reads for the table can occur with a simple lookup, no need to grab locks to add to the table, just a pure lookup. This can be made true for writes as well if its expanded on.

I'll add more if I think of anything else to add here.

LouisJenkinsCS on 21 Jul 2017

Okay, I am going to be doing this incrementally, as this is the easiest way to do this. I also restricted it to ~~8 nodes~~ 32 nodes instead of 64 to save time. What I'm going to do is show results of all operations; read, write, and compareExchange (with and without local on statements). Again, this is important in that both GlobalAtomicObject and CMPXCHG16B handle a write atomic operation differently from a read atomic operation.

Revisited read again, this time also comparing to pointer compression. GlobalAtomicObject is right in the middle and scales well enough.

GlobalAtomicObject shows how much time it takes to read a descriptor from a locale's table (likely remote). PtrCompression shows how long it takes for a normal network AMO + Pointer Decompression. CMPXCHG16B shows how long it takes to jump to the owning locale and perform an atomic read operation.

LouisJenkinsCS on 25 Jul 2017

This one shows atomic 'Write' operations, which vary a lot more.

As can be seen, the overhead of writing for descriptor tables is massive when the descriptor is not cached, but its more of a constant. It shows that CMPXCHG16B is much slower than anything else. As before, GlobalAtomicObject is in between both PtrCompression and CMPXCHG16B.

LouisJenkinsCS on 25 Jul 2017

Would like to note one crucial flaw in the descriptor table... this can't be solved from caching alone as its unavoidable. The flaw is that once the table begins filling up, adding elements to it becomes significantly slower. I'm assuming the issue is with my skip list implementation (in my defense it was my first time ever implementing, and it was based on the 'Skip List Cookbook' one). In reality once its cached on first use, there's no need to be have the skiplist, In the current implementation, performance is so abysmal it couldn't even finish at 1 node within 5 minutes (it starts very fast, then slows to a crawl).

Huge oversight, but still plenty of potential here. I'll try to compose a benchmark reusing memory for the time being.

Edit: I'm wondering if a potential bottleneck could be BigInteger.scan0() since I use that heavily to find the next free bit in the memory pool...

LouisJenkinsCS on 25 Jul 2017

Okay, after a very long hiatus, I believe I can finally make some more progress on this again.

One change I'm going to be making is stripping the skip-list, as searching for it tends to be the limiting factor, and becomes irrelevant later once caching is implemented. So instead, I'm going to regress to the original design of having to 'register' each object in the table, which then returns a tuple of both the object and the index as the descriptor. So...

extern type wide_ptr_t;
type descriptor_t = (wide_ptr_t, int(64));

proc register(obj:objType) : descriptor_t { }

In the original document for the idea, here, the performance for it was phenomenal, but that was because it did not have to worry about hashing objects to find their descriptors. This can still become a reality if caching it performed well enough. I'd like to explore this when I have the time.

Note: The final version likely will not require the user to 'register' anything and just requires proper tracking of the descriptor by runtime. As I'm short on time, this chapel-side version is the best I can do at the moment.

LouisJenkinsCS on 30 Oct 2017

Also as well, I plan to privatize tables across all nodes in the cluster, and use my RCU (trademark-aside) abstraction for allowing updates concurrently. I'll find a decent way to propagate such updates, but overall it will favor reusing memory over rapid creation and discarding of new memory to be registered.

LouisJenkinsCS on 30 Oct 2017

Okay, I think I have a way in which I can handle managing the descriptor table. There will be only one descriptor table, wherein each cell is capable of holding any wide_ptr_t, or rather any 128-bit value. In fact, this can be generalized to allow any size, even variable-sized, data depending on the way I implement it.

First and foremost, I'll need to clarify that the RCU and the DistVector implementation (which I had shown to @mppf in an email with @e-kayrakli) likely must come first to allow for concurrent reads while resizing the tables. The DistVector would serve as, essentially, a parallel-safe resizing distributed array that Chapel current lacks. The theory behind it is that we just create a static region of memory, and then reuse that memory in a new instance (added bonus of not needing to copy data over to the new data structure, just the pointer to the data the original instance used), and then use RCU to install that new instance. In other words, we add more indirection, which proves the adage correct that "All problems in computer science can be solved by another level of indirection".

After we have DistVector, then I can utilize that as the backbone for everything. I'm going to be using a global counter to keep track of the descriptor, although unfortunately this counter needs to be placed on a single node. Luckily this scales well enough, as my data structures have proven. When registering an object, we can simply increment the counter (if we cannot recycle a descriptor) and create a descriptor based on that. If we use this approach, privatization and resizing handled by DistVector, that essentially handles majority of the problem.

Problem: How do we recycle descriptors? After the data is no longer used, what happens to the slot that is no longer in use?

Potential Solution: Maintaining a global resizable 'free-list' using DistVector could work, or we could make use of DistDeque as a FIFO/LIFO way of reusing the descriptors. As in, when a descriptor is no longer used, we could first 'pop' or 'dequeue' first and utilize that directly in our table.

@gbtitus Any thoughts on this?

LouisJenkinsCS on 31 Oct 2017

Update:

I have implemented the DistVector (seen here), although it is suffering from poor performance due to lack of time. The DistVector properly incorporates the proposed RCU model into it's base, and as such allows concurrent reads during writes, although again the data structure is much slower (in the test at least; in the test we intermingle reads and writes, but writes are 1000000x slower than reads, so we can't reliably benchmark reads here...). I need to perform a lot of profiling, as I'm sure I'm suffering from communication from something I thought I'd privatized, and I'm using resizable domains over a much more cache-coherent statically allocated buffer (IIRC resizable domains will allocate new space, copy everything over, and delete the old; this is a huge bottleneck and contradictory to my design of 'reusing the old memory and linking with the new'). I can predict that this model will be a success, and can even scale, as mentioned before, given some extra time. Unfortunately, I no longer have that time, as I must catch back up on my studies.

I'll try to get back to this after I graduate (December 16th), so await some promising results!

LouisJenkinsCS on 26 Nov 2017

Fleshed out the idea more here: https://github.com/LouisJenkinsCS/Chapel-Atomic-Objects/issues/1#issue-278711689

LouisJenkinsCS on 2 Dec 2017

Implemented array w/ RCU, tested on single node (doing more testing in distributed setting later), seems promising. Array only allows dynamic allocation and indexing, but for purposes of descriptor table is all I need.

LouisJenkinsCS on 10 Jan 2018

I can't tell what the units on the chart are. Is higher better? Can you explain why RCU array is between Sync Array and Array?

mppf on 10 Jan 2018

Sure, but first let me add the results for 16 nodes...

I'll update this in a bit describing it (I actually started a blog recently, so I'll write it up there). Also, higher is better, its in terms of operations per second.

LouisJenkinsCS on 10 Jan 2018

I have the early parts describing the RCU algorithm, but it isn't complete yet, keep that in mind. I'll continue updating it and leave another notification when I'm done.

https://louisjenkinscs.github.io/Distributed-Vector/

LouisJenkinsCS on 10 Jan 2018

@LouisJenkinsCS - I havn't finished reading that yet but it reminded me of a problem we're facing. We have associative domains and arrays, but if you add to the domain, it invalidates array references. That has the side effect that parallel operations on the array can't work. This comes up with say making a histogram in parallel when you don't know the keys beforehand. I.e. I'd like to be able to write:

  var D:domain(int);
  var Count:[D] int;
  forall key in input {
    D += key;
    Count[key] += 1;
  }

but in fact this program suffers from a race condition, because appending to the domain could cause the array to resize and then invalidate the reference to Count[key] used by another task in the forall.

mppf on 10 Jan 2018

Interesting example, and looking at it RCU would be able to solve the problem (although I'd hate to make it out to be a kind of solve-all panacea). In the case of allowing indexing to be reads and appending to be a write, it would work albeit extremely slowly. Each iteration would be a full-blown write, so in the end you're better off doing this serially, or better yet doing some kind of map-reduce operation where each task spawns its own map and you combine them all at the end. RCU is for situations where we have 99.9% reads.

Edit:

Actually, originally I planned to allow insertions into the data structure in a way that allowed writes to be relatively cheap (only synchronization on a single sync-var needed) so long as we allocated enough memory ahead of time. I removed it because it wouldn't suit my purposes and it slowed down the reads by 3x. Still, it is possible to achieve reasonable performance that would best coarse-grained synchronization, but it would still reduce overall performance.

LouisJenkinsCS on 10 Jan 2018

Speaking of mappings... I actually had an idea for a distributed hash table based on my own published work I did for the Go programming language. It can actually be perfectly remodeled in a way that fits the distributed aspect (a bit too perfectly, truth be told), and would allow not only concurrent updates, but would scale so well I'm sure it'd put the default sparse distributions to shame (specific data structures definitely would outperform a general data structure at the specific task they are designed for after all)

LouisJenkinsCS on 10 Jan 2018

@mppf

It is done (for now), I'm not sure if I'll go back and add more visual demonstrations (likely I will) but the textual parts are mostly done (likely with many typos).

LouisJenkinsCS on 10 Jan 2018

Also, one more update, after removing 'sync array' as it causes runtime crashes (coarse-grained synchronization won't work at all then, which is even more interesting), I got it to run with 32 nodes...

DistVector is the array, so it wins by quite a bit 🥇

LouisJenkinsCS on 10 Jan 2018

Also, I see what you meant when you said it can be applied to chpl-privatization.c, as yeah the whole idea of allocating it in chunks and performing some RCU update that reuses the same memory (similar to my own array) should work marvelously. I'd do this myself, but I'm 100% certain how to test and verify the results after I finish.

LouisJenkinsCS on 11 Jan 2018

For chpl-privatization.c - I think that having a RCU capability in our standard modules would be a big deal, and if you can get the RCU in by fixing leaks in chpl-privatization, that'd be a good intermediate point (for PRs). I think we can help (myself and Elliot, probably) with verifying any update to chpl-privatization.c.

mppf on 11 Jan 2018

After making corrections to the benchmark, I see that there was extra communications I didn't account for before, but the good news is that the distributed vector is still scaling extremely well (like its consistently 1/3 the performance of the unsynchronized distributed Chapel array), which is extremely good... Like, 300M Op/Sec vs 900M Op/Sec good!

Now, after seeing that QSBR-based RCU yields literally zero-overhead, I realize that right now if I could implement that Chapel-side this would likely bring the data structure actually ahead of (or at least neck-and-neck) with the normal Chapel array, while offering concurrent resizing & indexing.

LouisJenkinsCS on 16 Jan 2018

Update:

I finally got the synchronized array to finish in time by lowering the number of operations per task to 1024 (down from 1M), the results are interesting and exciting.

random_indexing_1024

Edit: Changed labels to be more accurate

LouisJenkinsCS on 21 Jan 2018

So I was thinking about the true applicability and generality of the descriptor table... why stop at wide-pointers? In fact, its funny but the table could be used to allow atomic instances for structures as well. Essentially the descriptor acts as a layer of indirection similar to how a pointer does, so it would be possible to store at each slot a struct or record. In particular, for GlobalAtomicObjects it would be storing a struct/record, a wide_ptr_t. Its interesting actually... gone are wide pointers, replaced by descriptors.

LouisJenkinsCS on 9 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Should enums support `.indices` and indexing?

bradcray · 3Comments

Internal error when using isSubType on generic type

ben-albrecht · 3Comments

Build REST API layer for Chapel

buddha314 · 3Comments

Should "new initializers" support field initialization prior to super.init()?

bradcray · 3Comments

Overloading assignment (=) on class types

bradcray · 4Comments