go 🚀 - doc: define how sync/atomic interacts with memory model

Comment 1:

Why do you want them to be as weak as possible? Weak atomics are difficult to program.

dvyukov on 14 Mar 2013

👍1

Comment 2:

>We might say, for example, that an atomic.Store writing a value to a memory location
happens before an atomic.Load that reads that value from the memory location. Is that
something we want to say? If not, what do we want to say?
Yes, we want to say that.
Regarding Add/CAS, it should be formulated in in more general terms, along the lines of:
at atomic operation that stores a value (incl ADD/CAS) happens before atomic operation
that reads that value from the memory location (incl ADD/CAS).
However, this does not cover the Dekker synchronization pattern:
X = Y = 0
// goroutine 1
X = 1  // atomic
r1 = Y  // atomic
// goroutine 2
Y = 1  // atomic
r2 = X  // atomic
The rule above allows r1 == r2 == 0, however such outcome is impossible under sequential
consistency (total order).
Dekker pattern is used in tricky mutual exclusion algorithms and in safe object
reclamation schemes. On one hand it's used very infrequently, but on the other hand
there will be no way to implement it at all. That's why I am asking about "as weak as
possible".

dvyukov on 14 Mar 2013

Comment 3:

Let me amend my earlier statement: I want them to be as weak as possible
but still useful, like the current memory model is very weak compared to
what other languages have to say about the topic, but it's still useful.

rsc on 14 Mar 2013

Comment 4:

Current chan semantics are complete wrt the problem the solve.
Atomics won't be complete if they provide weak synchronization guarantees, i.e. some
problems will be unsolvable.
Moreover, sequential consistency is the simplest to specify (C/C++ complexity comes from
exactly weak atomics -- possible reorderings, data dependencies, etc).
Moreover, sequential consistency is easy to understand and explain (remember the recent
discussion and confusion about chan-based semaphores, and the Effective Go example was
incorrect for several years).

dvyukov on 15 Mar 2013

Comment 5:

I think we are using different meanings for the word weak. You have a very
precise meaning in mind. I do not. I just mean "let's not guarantee more
than we need to guarantee to make things useful for people." That's a
general goal, not a concrete proposal.
Dmitriy, if you have time, could you please make a proposal about what you
think the atomics should guarantee? A few sentences here in the issue is
fine.
Thanks.
Russ

rsc on 15 Mar 2013

Comment 6:

_Labels changed: added go1.2maybe, removed go1.1maybe._

robpike on 18 May 2013

Comment 7:

Please clarify more on your meaning of "weak".
The problem with atomic operations is that they are lower level that chans. There are
lots of practically useful things that are possible to build using atomics.
So what do you want to specify:
1. Semantics for majority of simpler use cases (say 95%), and leave the remaining cases
unspecified for now.
or 2. Semantics for all practically useful cases.
I would vote for 2, because sooner or later somebody will ask about the remaining 5% and
answer "you can rely on X guarantee, but we do not want to officially guarantee it" does
not look good. (btw we use that remaining 5% in e.g. WaitGroup).
And 2 is extremely strong, it's not weak in any possible sense of this word.

dvyukov on 30 Jul 2013

Comment 8:

I mean 1, especially if the semantics can be kept to a minumum.
No, that is not the answer. The answer is "if it is not in the memory model
you must not depend on it." If that's still true once we have defined the
semantics, we should rewrite WaitGroup.
I asked you to write a few sentences sketching the semantics you want, but
you haven't done that.

rsc on 31 Jul 2013

Comment 9:

The minimal semantics must be along the lines of:
"If an atomic operation A observes a side effect of an atomic operation B, then A
happens before B".
That's basically it.
Note that not only Load can observe the side effect. Return value from Add and
CompareAndSwap also allows in infer which side effect we observe. Read-modify-write
operations (Add, CAS) first observe side effect of a previous operation on the same var,
and then produce a new side effect. I imply that there is a total order Mv over all
atomic operations that mutate atomic variable V.
Such definition supports use cases like producer-consumer, object publication, etc.
However, such definition does not support trickier synchronization patterns. And frankly
I would not want to rewrite any existing synchronization primitives due to this. In
runtime we a dozen of such "unsupported" cases, I understand that that's different
atomics, but I just want to show that such use cases exist.
Semantics that cover all synchronization patterns would be along the lines of:
"There is a total order S over all atomic operations (that is consistent with
modification orders M of individual atomic variables, happen-before relations,
bla-bla-bla). An atomic operation A happens after all atomic operations that precede A
in S".
The trick here is that you usually can not infer S (w/o any pre-existing happens-before
relations). The only (?) cases where you can infer a useful information from S are:
1. When atomic operations A and B operate on the same var, and this makes this
definition a superset of the first definition (S is consistent with all Mv).
2. When it's enough to know that either A happens-before B or vise versa (this is true
for any pair of atomic operations due to total order S).

dvyukov on 8 Aug 2013

Comment 10:

>"If an atomic operation A observes a side effect of an atomic operation B, then A
happens before B"
s/A happens before B/B happens before A/

dvyukov on 8 Aug 2013

Comment 11:

How about this:
"Package sync/atomic provides access to individual atomic operations. These atomic
operations never happen simultaneously. That is, for any two atomic operations e1 and
e2, either e1 happens before e2 or e2 happens before e1, even if e1 and e2 operate on
different memory locations."
Is that a good idea? Is it too strong? Is it more than we need, less than we need? Is it
going to be too hard to guarantee on systems like Alpha? I don't know. But at least it
is simple and I understand what it is saying. That's different than understanding all
the implications.

rsc on 13 Aug 2013

Comment 12:

As per offline discussion, your "either e1 happens before e2 or e2 happens before e1"
definition looks good if data races are prohibited. Otherwise, racy accesses allow to
infer weird relations, e.g. that a Load happens-before a Store:
// thread 1
x = 1
atomic.Load(&whatever)
y = 1
// thread 2
if y == 1 {
  atomic.Store(&whatever2)
  println(x) // must print 1
}
This means that Load must execute release memory barrier and store -- acquire memory
barrier. Most likely this will make implementations of atomic operations costlier.

dvyukov on 14 Aug 2013

Comment 13:

Okay, maybe that's a bad definition then (I was just rephrasing yours, I
believe). It sounds like it is too strong. Are loads and stores the only
problem. Is this any better?
"""
Package sync/atomic provides access to individual atomic operations. For
any two atomic operations e1 and e2 operating on the same address:
  - if e1 is not a Load, e2 is a Load, and e2 observes the effect of e1, e1
happens before e2.
  - if e1 is a Store, e2 is not a Store, and e2 observes the effect of e1,
e1 happens before e2.
  - if neither operation is a Load or Store, either e1 happens before e2 or
e2 happens before e1.
"""

rsc on 14 Aug 2013

Comment 14:

Why don't you want to give up on data races?
We probably can ensure atomicity of word accesses in gc w/o sacrificing important
optimizations. But:
1. We can not ensure visibility guarantess, e.g. if a var is registrized in a loop, and
at this point racy accesses become almost useless.
2. Races are definitely not safe for maps and slices.
3. Most likely we can not ensure any guarantees for races in gccgo (not sure what gcc
java does here).
4. I do not see any benefits of allowing data races. Currently there is runtime cost for
calling atomic.Load instead of doing plain load. But this must be addresses by providing
better atomic operations with compiler support (if that becomes the bottleneck).
Allowing data races instead to solve this looks completely wrong.
If we prohibit data races, it would make reasoning about atomic operations much much
simpler.

dvyukov on 15 Aug 2013

Comment 15:

There are 2 litmus tests for atomic operations:
1.
// goroutine 1
data = 42
atomic.Store(&ready, 1)
// goroutine 2
if atomic.Load(&ready) {
  if data != 42 {
    panic("broken")
  }
}
2.
// goroutine 1
atomic.Store(&X, 1)
r1 = atomic.Load(&Y)
// goroutine 2
atomic.Store(&Y, 1)
r2 = atomic.Load(&X)
// afterwards
if r1 == 0 && r2 == 0 {
  panic("broken")
}
As far as I see you definition does not work for 2.

dvyukov on 15 Aug 2013

Comment 16:

For 2 to work, atomic operations (including loads and stores) must form total order.
Probably the following 2 clause definition can do:
(1) If an atomic operation A observes an effect of an atomic operation B, then B happens
before A.
(2) All atomic operations form a total order that is consistent with happens-before
relations, modification orders of individual atomic variables and intra-goroutine order
of operations.
(2) implies that values returned by atomic operations and their side effects are
dictated by the total order. I am not sure whether it's obvious or not.
Note that (2) does not introduce new happens-before relations. Even if you somehow infer
that A precedes B in total order (e.g. by using racy memory accesses), this gives you
nothing.

dvyukov on 15 Aug 2013

Comment 17:

You wrote "why don't you want to give up on data races?". That's not what I
am trying to do. I am trying to avoid making atomic.Load and atomic.Store
unnecessarily expensive.

rsc on 15 Aug 2013

Comment 18:

It is more difficult to do in presence of data races.
W/o data races the following looks OK:
"Package sync/atomic provides access to individual atomic operations. These atomic
operations never happen simultaneously. That is, for any two atomic operations e1 and
e2, either e1 happens before e2 or e2 happens before e1, even if e1 and e2 operate on
different memory locations."

dvyukov on 15 Aug 2013

Comment 19:

Okay, then define how to prohibit data races.

rsc on 15 Aug 2013

Comment 20:

It's simple. We need to replace:
--------------------------
To guarantee that a read r of a variable v observes a particular write w to v, ensure
that w is the only write r is allowed to observe. That is, r is guaranteed to observe w
if both of the following hold:
w happens before r.
Any other write to the shared variable v either happens before w or after r.
This pair of conditions is stronger than the first pair; it requires that there are no
other writes happening concurrently with w or r.
Within a single goroutine, there is no concurrency, so the two definitions are
equivalent: a read r observes the value written by the most recent write w to v. When
multiple goroutines access a shared variable v, they must use synchronization events to
establish happens-before conditions that ensure reads observe the desired writes.
The initialization of variable v with the zero value for v's type behaves as a write in
the memory model.
Reads and writes of values larger than a single machine word behave as multiple
machine-word-sized operations in an unspecified order.
--------------------------
with:
--------------------------
If there is more than one such w, the behavior is undefined.
The initialization of variable v with the zero value for v's type behaves as a write in
the memory model.
--------------------------

dvyukov on 15 Aug 2013

Comment 21:

Is this converging?

robpike on 20 Aug 2013

Comment 22:

Difficult to say. I would not hurry with this. Moving to 1.3.

_Labels changed: added go1.3, removed go1.2maybe._

dvyukov on 20 Aug 2013

Comment 23:

_Labels changed: removed go1.3._

robpike on 21 Aug 2013

Comment 24:

_Labels changed: added go1.3maybe._

rsc on 27 Nov 2013

Comment 25:

_Labels changed: added release-none, removed go1.3maybe._

rsc on 4 Dec 2013

Comment 26:

_Labels changed: added repo-main._

rsc on 4 Dec 2013

Is there an update to this?

I don't want to stick my beak in, but with the race detector recognising atomics I think there may be a danger that the semantics of this package will be determined by tools and usage. As Dmitry said the WaitGroup implementation assumes stronger guarantees than are present. I would personally suspect that almost any use of the atomic package is assuming stronger guarantees than are currently available.

fmstephe on 23 Dec 2014

Just to clarify what I said above about the tools defining the semantics for the atomic package implicitly.

https://play.golang.org/p/E8KPV7pvov

When I run 'go test -race' on a function written as main() above, the race detector points to a data-race on value.

https://play.golang.org/p/HPc-1Hwvet

When I run 'go test -race' on this version it detects no race at all. It seems that the race detector has concrete semantics attached to the atomic package, including memory ordering etc. Or have I missed something?

Apology that the examples can't run on the playground, without GOMAXPROCS > 1.

fmstephe on 24 Dec 2014

Yes, race detector gives sync/atomic concrete semantics. In terms of C/C++ memory model every atomic operation become memory_order_acq_rel.

dvyukov on 24 Dec 2014

@RLH, @aclements, and I spent a few hours rereading the discussion here and trying to understand what we might do. Certainly it's too complex for Go 1.6.

A large part of the problem seems to be that the program order happens-before and the atomic operation happens-befores interact in bad ways. It may not be a good idea to treat them as being the same kind of relationship. If give atomic operations a total order in happens-before that is the same happens-before as the per-goroutine program order (and you allow racing reads), then you end up needing to guarantee the program in Dmitriy's Comment 12 above, which imposes serious constraints on the atomic implementation but also on the compiler.

One way to avoid this is to disallow programs with races: if you have a race anywhere in your program, all bets are off for the entire program. I am uncomfortable with this, because I don't think it's terribly good guidance. If I have a race-free program and I drop in an unlocked, non-atomic counter and increment it in a few different goroutines and print it at the end of the program, I believe the damage caused to the program by the counter race should be somewhat limited. The rest of the program execution should be unaffected. Saying the whole program is now invalid does not seem tenable to me, especially in large-scale programs.

Another way to avoid this is to say that the happens-before arrow for atomics only exists from a Store to a Load observing the effect of that store (or more generally from one atomic operation to another observing the effect of the first). That would allow Dmitriy's program 2 in Comment 15 to panic(broken). That is, independent reads of independent writes would not need to be consistent with a global total order. Maybe we should allow that, but we've put barriers in even our x86 atomic implementations to disallow it, so empirically we seem not to want to allow that.

Rick raised the point that we don't want the model to be significantly weaker than the guarantees unavoidable on the x86. If the model were weaker than the x86, then people would write Go programs for the x86 and blame Go when they didn't run on other systems. (I observe that this is similar to what happened with Alpha and Itanium.) And if we boost the other systems to match the x86's guarantees then it's important not to overshoot: it's just as bad for people to be writing programs that only work correctly on ARM because we over-strengthened the atomics there.

As I mentioned above, it may be that the order of operations in a single goroutine and the order of atomic operations are simply different kinds of orders, and it is not tenable to express them with a single happens-before relation. Austin suggested that perhaps the model needs to be able to express that operations within a goroutine may appear to be in a different order to that goroutine as to another. The current model tries to do that with the idea of things happening concurrently, but perhaps the current formulation does not capture enough of the subtlety.

It may be that we need to abandon happens-before and write something more operational. I tried in the memory model to guarantee as little as possible, so that the space of options for future formulations and revisions was as large as possible. So it is plausible that another formulation could still be backwards compatible with what's there today.

It may also be that we should just continue to guarantee as little as possible. For example, I am wondering if we should say only that:

If the effect of atomic operation A is observed by atomic operation B, then A happens before B.

That allows 0,0 from Dmitriy's Comment 15 program 2 (independent reads of independent writes). Is that such a bad thing? That would imply that we don't need any barriers in atomic operations on x86, because TSO already guarantees what I just wrote. Can we safely take the x86 barriers out? What bad things would happen if we did? @dvyukov?

rsc on 7 Jan 2016

There are three practically interesting synchronization patterns wrt memory ordering (in increasing order of required synchronization):

(A) "Stop flag", no associated data, requires no memory ordering (only atomicity and visibility):

// goroutine 1
for atomic.Load(&stop) == 0 {...}

// goroutine 2
atomic.Store(&stop, 1)

(B) Producer-consumer, there is associated data so it requires acquire/release pair, but synchronization is weak and causal (no barriers on x86) (this is comment 15, program 1):

// goroutine 1 (producer)
data = 42
atomic.Store(&ready, 1)

// goroutine 2 (consumer)
if atomic.Load(&ready) != 0 {
    assert(data == 42)
}

(C) Dekker/Peterson mutual exclusion, no associated data per se (but can be combined with (B)), strong synchronization (loads and stores allow to achieve global consensus, requires barriers on x86) (this is comment 15, program 2):

// goroutine 1
atomic.Store(&X, 1)
r1 = atomic.Load(&Y)

// goroutine 2
atomic.Store(&Y, 1)
r2 = atomic.Load(&X)

// afterwards
if r1 == 0 && r2 == 0 {
  panic("broken")
}

And IRIW (independent reads of independent writes) is a completely different pattern: goroutine 1 writes X, goroutine 2 writes Y, goroutine 3 loads X and Y, goroutine 4 loads Y and X. Goroutines 3 and 4 must agree on order of stores to X and Y. This is not practically interesting.

I afraid we can't discard case (C). That's a fundamental pattern. We used it in sigqueue (with sig.mask and sig.kick being A and B); in sync.Waitgroup (with wg.counter and wg.waiters being A and B); in runtime semaphores (with addr and root.nwait being A and B); in several places in netpoll and runtime scheduler. Things will break if we take out that barrier.

I have three points regarding data races.
(1) Giving any meaningful semantics to data races will compromise performance and is close to impossible for gccgo and llgo. It is not as "you either see old value or new value". Compilers generally rely on absence of data races and there is at least one known case when gcc generates non-atomic stores for word-sized variables (stores of 64-bit immediates are done as two separate stores). Or consider that we replace a memset loop with REP STOSB, that will break any atomicity. Or consider the following code:

func foo(x int) {
  *p = x
}

you would expect this to be atomic, right?
Now add:

func bar(y int) {
  y |= *p & ^flags
  foo(y)
}

after inlining this becomes:

y |= *p & ^flags
*p = y

which can be reasonably compiled as (same 2 memory access, less register pressure):

*p &= ^flags
*p |= y

That's a real example from Linux kernel. It is basically impossible to prove that something like this won't happen to any of your stores.
Now you could say that, "value of a variable that is subject to data races becomes undefined but other variables are unaffected".
This is not necessary true as well. Consider that a racy variable is involved in index operation (do a bounds check, then evict the value due to register pressure, reload, index); or in control flow (jump tables, uh), or that compiler assumes that a variable contains the value that we've just stored into it, e.g.:

x = a
y = a

compiled as:

x = a
y = x

because there is register pressure, a is spilled or discarded, and recomputing address of a is more expensive than recomputing address of x.

Concurrent compacting GC can also have unpredictable effects in presence of data race. E.g. Sapphire can produce chatter stores in presence of races.

(2) Giving only part of guarantees (only word-sized variables) is not useful. Especially for large-scale programs. It is not that you know that you have only one data races in a large-scale program and it happens to be on a stats counter. You don't know where are the races and how many are there. If you forgot a mutex lock, you can as well have a data race on a map. Then all bets are off.

(3) Giving no guarantees to programs with races on specification level does not prohibit us from giving any guarantees as QoI attribute. In fact, I think it was a mistake for Java to mix these two levels. Lots of Java code is now sprinkled with data races. Reasoning about such code is hard. Race detection tooling is impossible. And authors are formally right, they are relying on language spec.

dvyukov on 8 Jan 2016

Thanks for the reply. I'll take some time to study it.

rsc on 8 Jan 2016

I would be careful about tying the go MM to hardware memory models - such
as x86/amd64 etc.

When reasoning about multi-threaded safety of at the source code level the
hardware memory model is usually irrelevant. Not talking about compiler
writers here.

As an example we can look at Java, whose mm is weaker than x86/amd64. The
following classical piece of code is unsafe

public class Test {
    public int constructorValue private static Test instance;
    public Test() { constructorValue=42; }
    public static Test getInstance() {
        if (instance == null) {
            synchronized (Test.class) { 
                if (instance == null) {
                    instance = new Test();
                }
          }
       } return instance; 
    }
 }

We know that it is possible to get an instance on Test where
constructorValue isn't 42. This is because when we construct the Test
object, the non-null write to instance may become visible before the write
to classValue. This is nothing new.

But, the important thing about this example is that it is broken even on
x86/amd64. The strong ordering of writes guaranteed by x86/amd64 doesn't
help us because the most dangerous actor here is the compiler who can (and
often will) perform much more aggressive reorderings than even an alpha
chip.

Of course if the compiler isn't running aggressive optimisations you may
have a situation where x86/amd64 memory model is saving people from bugs.

The compiler has far more opportunities for read/write reordering, so I
don't think hardware memory models are a good guide for what should be
presented to the application programmer.

F

On Thu, Jan 7, 2016 at 6:37 PM, Russ Cox [email protected] wrote:

@RLH https://github.com/RLH, @aclements https://github.com/aclements,
and I spent a few hours rereading the discussion here and trying to
understand what we might do. Certainly it's too complex for Go 1.6.

A large part of the problem seems to be that the program order
happens-before and the atomic operation happens-befores interact in bad
ways. It may not be a good idea to treat them as being the same kind of
relationship. If give atomic operations a total order in happens-before
that is the same happens-before as the per-goroutine program order (and you
allow racing reads), then you end up needing to guarantee the program in
Dmitriy's Comment 12 above, which imposes serious constraints on the atomic
implementation but also on the compiler.

One way to avoid this is to disallow programs with races: if you have a
race anywhere in your program, all bets are off for the entire program. I
am uncomfortable with this, because I don't think it's terribly good
guidance. If I have a race-free program and I drop in an unlocked,
non-atomic counter and increment it in a few different goroutines and print
it at the end of the program, I believe the damage caused to the program by
the counter race should be somewhat limited. The rest of the program
execution should be unaffected. Saying the whole program is now invalid
does not seem tenable to me, especially in large-scale programs.

Another way to avoid this is to say that the happens-before arrow for
atomics only exists from a Store to a Load observing the effect of that
store (or more generally from one atomic operation to another observing the
effect of the first). That would allow Dmitriy's program 2 in Comment 15 to
panic(broken). That is, independent reads of independent writes would not
need to be consistent with a global total order. Maybe we should allow
that, but we've put barriers in even our x86 atomic implementations to
disallow it, so empirically we seem not to want to allow that.

Rick raised the point that we don't want the model to be significantly
weaker than the guarantees unavoidable on the x86. If the model were weaker
than the x86, then people would write Go programs for the x86 and blame Go
when they didn't run on other systems. (I observe that this is similar to
what happened with Alpha and Itanium.) And if we boost the other systems to
match the x86's guarantees then it's important not to overshoot: it's just
as bad for people to be writing programs that only work correctly on ARM
because we over-strengthened the atomics there.

As I mentioned above, it may be that the order of operations in a single
goroutine and the order of atomic operations are simply different kinds of
orders, and it is not tenable to express them with a single happens-before
relation. Austin suggested that perhaps the model needs to be able to
express that operations within a goroutine may appear to be in a different
order to that goroutine as to another. The current model tries to do that
with the idea of things happening concurrently, but perhaps the current
formulation does not capture enough of the subtlety.

It may be that we need to abandon happens-before and write something more
operational. I tried in the memory model to guarantee as little as
possible, so that the space of options for future formulations and
revisions was as large as possible. So it is plausible that another
formulation could still be backwards compatible with what's there today.

It may also be that we should just continue to guarantee as little as
possible. For example, I am wondering if we should say only that:

If the effect of atomic operation A is observed by atomic operation B,
then A happens before B.

That allows 0,0 from Dmitriy's Comment 15 program 2 (independent reads of
independent writes). Is that such a bad thing? That would imply that we
don't need any barriers in atomic operations on x86, because TSO already
guarantees what I just wrote. Can we safely take the x86 barriers out? What
bad things would happen if we did? @dvyukov https://github.com/dvyukov?

—
Reply to this email directly or view it on GitHub
https://github.com/golang/go/issues/5045#issuecomment-169748442.

fmstephe on 8 Jan 2016

My two cents from the peanut gallery: It might be worth expanding the sync/atomic API to include explicit ordering annotations. Much of the discussion in this bug resolves around the question "what is the right guarantee for each operation in sync/atomic?" I am not convinced there exists a good answer to this question, as Dmitry's examples have shown. C++ avoids this question entirely by having each atomic operation explicitly specify its ordering requirements. That approach may seem more complex, but it has benefits:

Algorithms that use atomics are extremely subtle. Ordering annotations document exactly what the author assumes, which helps the reader reason about correctness. Hans Boehm makes a similar argument here:

[Using explicit ordering constraints] makes it harder to ignore ordering issues. Ignoring them is almost always a mistake at this level. A number of existing atomics interfaces suffer from either sloppiness in dealing with memory ordering, or from making it impossible to state some useful ordering constraints. Empirically, this has resulted in many bugs.

The runtime can use the weakest (fastest) hardware primitives available that satisfy the programmer's constraints. This benefits the program at the cost of some implementation complexity in the runtime. Note that the runtime need not be as complex as it may seem -- it's perfectly reasonable for the runtime to implement all operations using the strongest ordering (sequential consistency) by default, then later add faster implementations for a small number of weaker annotations as needed.

Sections 5 and 6 of this paper are also good reading.

tombergan on 25 Mar 2016

@rsc, have you had time to study this?

quentinmit on 10 Oct 2016

Yes, I spent a while on this last winter but didn't get a chance to write it up properly yet. The short version is that I'm fairly certain the rules will be that Go's atomics guarantee sequential consistency among the atomic variables (behave like C/C++'s seqconst atomics), and that you shouldn't mix atomic and non-atomic accesses for a given memory word.

rsc on 10 Oct 2016

👍9 ❤1

@rsc Would it be possible to push this to 1.9 Early?

carl-mastrangelo on 9 Feb 2017

👍2

Possibly, although it shouldn't matter given the statement above. If the statement above misses some important detail you need to know, please let me know what that detail is. :-)

rsc on 10 Feb 2017

If you're desperate enough to break out the atomics, you may as well be very explicit about ordering issues. So: would it be possible to leave the current sync/atomic functions out of ref/mem entirely (i. e., leave them per se at the equivalent of C relaxed atomics) and instead add explicit barrier "functions" with special semantics to sync/atomic?

This would pick up on @tombergan's explicitness suggestion, except that, from the client's perspective, not the current atomics would be annotated, but explicit/verbose order barriers added. From the client's perspective, there would be a couple of new functions:
go LoadBarrier(addrs ...interface{}) StoreBarrier(addrs ...interface{})
which don't let loads from and stores to, respectively, the specified addresses go past them.

The (B) and (C) examples of @dvyukov would then become
````go
// goroutine 1 (producer)
data = 42
atomic.StoreBarrier(&data, &ready)
atomic.Store(&ready, 1)

// goroutine 2 (consumer)
if atomic.Load(&ready) != 0 {
atomic.LoadBarrier(&data, &ready)
assert(data == 42)
}
andgo
// goroutine 1
atomic.Store(&X, 1)
atomic.StoreBarrier(&X)
atomic.LoadBarrier(&Y)
r1 = atomic.Load(&Y)

// goroutine 2
atomic.Store(&Y, 1)
atomic.StoreBarrier(&Y)
atomic.LoadBarrier(&X)
r2 = atomic.Load(&X)

// afterwards
if r1 == 0 && r2 == 0 {
panic("broken")
},
````
respectively.

From the implementor's point of view, it would have to be ensured that all goroutines sharing barriers for some memory location would observe each others' side effects in an order consistent with the program order of the barriers, and that within each goroutine, no reordering across barriers occurs for the specified locations.

This approach should be strong enough to distinguish (in C terminology) between relaxed and non-relaxed orders, plus maybe a bit more (it's late and I'd have to think about it for some time longer), and would, at the cost of some verbosity, be more flexible than @rsc's approach of specifying all atomics to be the equivalent of C sequentially consistent.

Of course, the compiler support part of this approach might be rather involved.

TheCount on 4 Apr 2017

👍2

Sorry, but no.

rsc on 5 Apr 2017

Is the memory model "static"? If any changes were made then, would it not be considered a breaking change?

Put differently, if I write code using atomics, will the behavior differ between go versions?

poy on 8 Apr 2017

👍2

@apoydence There is no intent for the behavior of the sync/atomic package to change between Go versions. However, since that behavior is not precisely defined--that is what this issue is about--I don't see how that helps you.

ianlancetaylor on 8 Apr 2017

If guaranteeing sequential consistency proves to be too expensive, should weaker atomic operations be provided in a sub-repository? Given that users of atomics are likely sensitive to performance.

Benchmark on a single producer single consumer queue. Low-level constraints corresponds to absence of sequential consistency, which breaks the second example of Comment 15.

Picture from CppCon 2014: Jeff Preshing "How Ubisoft Develops Games for Multicore - Before and After C++11"

Sinacam on 9 Apr 2017

To be honest, I don't think that people who need to worry about the slow down inherent in using sequential consistency for atomic operations are going to choose Go as their programming language.

ianlancetaylor on 11 Apr 2017

@ianlancetaylor This may be true today, but 10 years from now the CPU in your tablet will have less than 100 opcodes, 1024 hardware threads and minimal implicit consistency. At this point, almost everyone will need to worry about slowdowns inherent with unnecessary syncing.

Fortunately, this does not mean everybody has to understand some complicated memory model. But there has to be some way to provide inter-thread communication efficient in this sense to the mainstream programmer (doesn't have to be via atomics, could also be via easier-to-understand user-defined weakenings of the channel concept, etc., etc.). As of today, this is not possible within Go. If it remains impossible in Go, at some point mainstream programmers are indeed not going to choose Go as their programming language.

TheCount on 12 Apr 2017

Of course I agree that we need efficient inter-goroutine communication. That is one of the driving forces behind the development of Go in the first place. But as even you suggest, that does not imply that the right approach is to have atomic operations with relaxed memory models in the standard library.

For example, it is already the case that channel operations do not require sequential consistency.

ianlancetaylor on 12 Apr 2017

Agreed. It totally escaped me that the channel implementation does not need sequential consistency. In this light, giving sync.atomic seq-cons semantics makes perfect sense: they complement channels.

TheCount on 14 Apr 2017

Given this issue, I don't think the implementations of sync.RWMutex, sync.WaitGroup and atomic.Value are well-defined as they heavily rely on atomic operations.

mrwonko on 22 Jun 2017

The semantics of RWMutex are well-defined since they are documented in the Memory Model doc. The implementation's job is to provide those semantics, and we're confident that it does. We can't justify that assertion from the memory model alone, but that's OK. It's OK that for reasoning about that specific implementation, we know how the underlying pieces work.

You're right that sync.WaitGroup is missing from the memory model doc, but it establishes all the happens-before edges one would expect.

rsc on 22 Jun 2017

It seems that without "happens before" semantics it is very hard if not impossible to write user defined lock-free structures that have any real value...
The Java memory model defines "happens before" for all of the strong atomic operations, and this allows for advanced lock-free structures/queues/etc.

robaho on 13 Aug 2018

Quoting https://github.com/golang/go/issues/5045#issuecomment-252730563 above:

Yes, I spent a while on this last winter but didn't get a chance to write it up properly yet. The short version is that I'm fairly certain the rules will be that Go's atomics guarantee sequential consistency among the atomic variables (behave like C/C++'s seqconst atomics), and that you shouldn't mix atomic and non-atomic accesses for a given memory word.

It's been two years and I still haven't gotten a chance to write this up properly, but it's getting close to the top of my work queue. The two years of on-again, off-again discussions have essentially confirmed this approach. People already write code assuming basically these semantics - I can't imagine doing something weaker than this from the "what you can rely on" point of view. So I don't think anyone should be blocked on this.

rsc on 14 Aug 2018

👍7

At two recent go conferences I have heard very high quality talks on low level concurrency in Go.

One was about a specific lock free algorithm used by prometheus. The other looking into how Go's race detector works. Both of these talks were of a very high quality and the audience showed a strong interest in the topics presented.

This demonstrates that there is an eager audience for documenting the semantics of the atomic package.

It also clear that while low-level atomics are niche and difficult to use they are already being used in production. The prometheus library is used extensively in a lot of Go systems.

If I do a search for source code importing atomic in just the packages I have on my work laptop I get 60 hits, only 3 of those are in code we authored ourselves (and we don't use prometheus).

I only write this to provide anecdotal data which suggests that documenting the semantics of the atomic package is probably very valuable. I'm not sure if that's a helpful comment or not.

fmstephe on 31 Jan 2019

specific lock free algorithm used by prometheus

Would be interesting in watching this talk, do you know if there is a video anywhere?

nhooyr on 2 Mar 2019

@nhooyr https://docs.google.com/presentation/d/1wuNNW-g6v8qizIc_IxAGZTj-49TODKF0TYddTA1VDUo/edit#slide=id.p

jasonwatkinspdx on 9 Mar 2019

👍2

_Please take this statement with some grain of salt, since my knowledge is more theory than practice._

I think that the lock-free algorithm used in the Prometheus talk linked by @nhooyr may exhibits undefined behaviour (a data race) on non-x86 hardware, specifically ARM with multiple processors and goroutines.

https://en.wikipedia.org/wiki/Memory_ordering specifies that x86 (but not ARM) prevents reordering of loads/stores after atomic operation (sync/atomic Load/Store/Inc/etc...)

Linux seems to even provides semantics to issue atomic_*_{acquire, release}() ops : https://www.kernel.org/doc/Documentation/atomic_t.txt

I wouldn't surprised if in Linux the rationale was to provide users of atomic_* operations to not issue an (useless) memory barrier after atomic ops (useless only on platform where atomic op are already serializing)

Golang source code seems to not explicitly issue memory barrier instructions on ARM after atomic-Xadd, but my asm knowledge for ARM is zero. On x86, Golang seems to only issue "lock"-prefixed xaddl (without barriers, but locked instructions should already be synchronising on x86).

If there Is confirmation that ARM can reorder atomic with load/stores, and that Golang does not explicitly add memory barrier instructions on atomic ops, maybe the histogram.go file from Prometheus is indeed subject to data race on ARM.

This would be a hint that in the Go language are missing either
1) user-issuable memory barrier
2) atomic_*_{relaxed, acquire,release} instructions ?

Maybe those two missing from the language is for the better, since missuses can cause sporadic data race.

In all cases, maybe the Go documentation should explicit if atomic_*() ops are either :
1) undefined behaviour when used trans-goroutines, on the specific matter of happens-before relationships, i.e. "it depends on the platform" (x86 will work ; ARM will bug)
2) defined behaviour : "we explicitly issue memory barrier instructions in your back" (heavy impact ?)
3) defined behaviour if used with acquire/releases semantics (missing from Go)

On the specific subject of user-issuable memory barrier, I guess that for any specific goroutine, a lock+unlock (successive) of an useless atomic.Mutex, only accessible from the specific goroutine, uncontended, and alone on it's cache-line could maybe be a hacky way of issuing a full memory barrier in Go.

On another hand, I understand that the rationale behind Go is to not try to synchronize your coroutines yourself, and instead use sync and sync/atomic which may or may not define the happens-before relationship.

For the "histogram.go" files, I did not benchmarked alternatives, but I'm a little sceptical on the rationale behind the "we do not want a channel to synchronize tons of observations from multiple goroutines".
To me it seems that if multiples coroutines have tons of observations, which they atomic_add() themselves (the proposed solution of the talk speaker), it would incur some heavy cache-line bouncing.
I guess that atomic_*() instructions, even wait-free, are not cheap if your CPU core is not already an owner of the cache-line and if there is heavy contention.
I guess that only one goroutine in charge of those atomic_inc() could benefit from an uncontested cache-line on the histogram, and could receive all observations from one channel (or even multiple channels to prevent contention -if it can happens- on the unique channel)

For the part I call "how to get a snapshot of the histogram to Write() it elsewhere, without blocking all observations" I guess that the speaker of the talk could keep his technique of having two versions of the same data structure (a hot and a cold one), and synchronizing the swap (upon Write-to-elshewhere) and access (upon obervations) with a sync.RWMutex.

The Write-to-elsewhere would Lock()+swap+Unlock()
The observations would RLock()+atomic.Inc()+Unlock()
And the atomic.Inc() could be downgraded to downgraded to a non-atomic one if there was only one goroutine.

On a personnal preference, I would like to have access to low-level semantics from Go (atomic operations, relaxed or not, and barriers)
It would gives options to try and benchmark code using them compared to order-synchronisation primitives provided by Golang (somewhat only mutex + channels).

folays on 27 Mar 2019

@folays The purpose of this issue is to document exactly what the Go functions in the sync/atomic package do.

Golang source code seems to not explicitly issue memory barrier instructions on ARM after atomic-Xadd, but my asm knowledge for ARM is zero.

The implementation of Xadd on ARM can be found at https://golang.org/src/runtime/internal/atomic/atomic_arm.go#L41 . As you can see, it relies on Cas. Cas is implemented either by the Linux kernel, or by https://golang.org/src/runtime/internal/atomic/asm_arm.s#L7 . In both cases the appropriate DMB (data memory barrier) instruction is used.

On a personnal preference, I would like to have access to low-level semantics from Go (atomic operations, relaxed or not, and barriers)

That is a fine thing to discuss, but that discussion is not appropriate for this issue. Please use the golang-nuts group for that. Thanks.

ianlancetaylor on 27 Mar 2019

I am trying to do some optimizations using ARMv8.1 instructions but face the difficulty to choose right instructions due to missing memory ordering definition.

There might be some other issues caused by missing memory ordering definition. Potentially, it can cause misalignment between Go application developers and Go toolchain developers. And the behavior may be not portable among different architectures. For example:

// goroutine 1
atomic.Store(&X, 1)
r1 = Y
// goroutine 2
atomic.Store(&Y, 1)
r2 = X
// afterwards
if r1 == 0 && r2 == 0 {
  panic("broken")
}

Please refer to the runnable test code. https://github.com/zhangfannie/go/blob/testatomic/test.go
Above code won’t panic on x86, but may panic on arm64. Without an explicit memory ordering definition, it is hard to tell whether the application code is wrong, Golang x86 is wrong or Golang arm64 is wrong.
Can we put some explicit definition in the document? So that people can have unified understanding of the memory order. And it will be helpful for us to select the right instructions when implementing the Golang backend.
If we think current Golang implementation is correct and satisfies most use scenarios, it might be reasonable that we just refer to the most relaxed behavior among different backends for each operations. So that we can have a memory order definition without breaking existing applications.

zhangfannie on 28 Mar 2019

@zhangfannie You have a race with f1 reading Y and vice versa; the code is wrong on all platforms, even if some platform might not manage to trigger the race.

tv42 on 28 Mar 2019

👍1

r1 and r2 need to be read by atomic.load(). But with a more explicit memory model that might not be needed - depends on if the compiler understands the semantics of the atomic methods.

robaho on 28 Mar 2019

@tv42 @robaho @rsc The data race in the example code id intended. Actually, this is a modified version of Deller pattern from https://github.com/golang/go/issues/5045#issuecomment-66076296. Here what I want to say is, we do not know whether this code is wrong or not without a clear order definition.

The code is wrong if atomic.Store follows store release semantic and atomic.Load follows load acquire semantic. The load should be changed to atomic.Load.
The code can be correct if the memory order definition requires full barriers inserted before and after atomic.Store.

Having a memory model defined can always be better than guessing the behavior of the toolchain.

zhangfannie on 29 Mar 2019

Actually, since the program must be sequentially consistent in the absence of threads, the program should never panic.

robaho on 29 Mar 2019

@zhangfannie The only happens-before relationship between the last iteration r1[i] = Y[i] and main is through wg.Wait(). There are no atomic operations after that last (non-atomic) assignment. There is no happens-before relationship at all between the last r1[i] = Y[i] and any part of f2. Hence, f2 reading the last r1[i] is a race. This follows from current-day contents of https://golang.org/ref/mem . The comment you tried to link to uses atomic loads; your code does not.

tv42 on 29 Mar 2019

Even with atomic loads there is a data race, in that the only thing you can reason is that the values will never be 0 (unless they wrap around)

robaho on 29 Mar 2019

Change https://golang.org/cl/185737 mentions this issue: [RFC]sync/atomic: specify the memory order guarantee provided by Load/Store

gopherbot on 11 Jul 2019

Change https://golang.org/cl/189417 mentions this issue: sync/atomic define sync/atomic memory models

gopherbot on 8 Aug 2019

Can we just add "Go's atomics guarantee sequential consistency among the atomic variables (behave like C/C++'s seqconst atomics). You shouldn't mix atomic and non-atomic accesses for a given memory word, unless some other full memory barrier, like a Mutex, guarantees exclusive access. You shouldn't mix atomics of different memory sizes for the same address."

I think it's bad that this trivial issue has been open for 6 years. At this point there's no way to specify anything more relaxed, even if that were desirable, without breaking code in the wild in hard to detect ways.

I'll sign the contributors agreement and figure out how to issue the pull request / change request for it, if somebody on the Go team will sign off on the final wording and commit to reviewing/merging it.

eloff on 16 Nov 2019

👍1

For atomics to useful I think they have to establish a “happens before” relationship.

On Nov 16, 2019, at 9:53 AM, Daniel Eloff notifications@github.com wrote:

Can we just add "Go's atomics guarantee sequential consistency among the atomic variables (behave like C/C++'s seqconst atomics). You shouldn't mix atomic and non-atomic accesses for a given memory word, unless some other full memory barrier, like a Mutex, guarantees exclusive access. You shouldn't mix atomics of different memory sizes for the same address."

I think it's bad that it's taken 6 years to specify it, and at this point there's no way to specify anything more relaxed, even if that were desirable, without breaking code in the wild in hard to detect ways.

I'll sign the contributors agreement and figure out how to issue the pull request / change request for it, if somebody on the Go team will sign off on the final wording and commit to reviewing/merging it.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

robaho on 16 Nov 2019

I agree with @eloff , we should let all users know that all exposed functions in sync/atomic guarantee sequential consistency, not only known in the go team(eg, https://github.com/golang/go/issues/32428#issuecomment-498719675 and https://stackoverflow.com/a/58892365/3382012), unless one day we decide to expose atomic.LoadAcq/atomic.StoreRel or alike , by that time we can add additional document for these functions though. It has been pending for 6 years, time to make a change:)

zhiqiangxu on 14 Dec 2019

@robaho I'm referring to the C++ atomics happens-before ordering when talking about consistency. The docs can either point one to the C++ docs for the happens-before wording, or copy it.

Since there's no movement on this, I'm going to sign the contrib agreement and submit a pull-request.

eloff on 22 Dec 2019

Here's a neat question that needs to be resolved: are programs allowed to use 32-bit atomic ops concurrently with 64-bit atomic ops on the same memory?

For example, is this program racy?

    var x uint64
    xa := (*[2]uint32)(unsafe.Pointer(&x))
    xl := &xa[0]
    xh := &xa[1]

    done := make(chan struct{})
    go func() {
        atomic.StoreUint64(&x, 0xbadc0ffee)
        close(done)
    }()

    x0 := atomic.LoadUint32(xl)
    x1 := atomic.LoadUint32(xh)

    <-done

My instinct is that this sort of size-mixing must not be allowed, because the implementations for 32-bit and 64-bit atomic ops may differ on some 32-bit hardware. However, the race detector does not currently flag it.

bcmills on 5 May 2020

specific lock free algorithm used by prometheus

Would be interesting in watching this talk, do you know if there is a video anywhere?

@nhooyr
GopherCon UK 2019: Björn Rabenstein - Lock-free Observations for Prometheus Histograms

https://youtu.be/VmrEG-3bWyM

yangwenmai on 10 Jul 2020

👍3

Currently, the usage examples of atomic.Value are unsound without defining the happens-before relationships between Store() and Load().

leventov on 11 Nov 2020

My team just learned about this issue and we are using atomics for years. Are there at least _any_ guarantees or is using atomics always undefined behavior in Go?

chrisprobst on 3 Dec 2020

👀2

Pedantically, when talking about things like memory models "undefined behavior" means that any random bug can happen (https://en.wikipedia.org/wiki/Undefined_behavior). That isn't the case here. Go is always going to do some approximation of the right thing.

There has been a lot of discussion on this issue. The general goal here remains https://github.com/golang/go/issues/5045#issuecomment-252730563 . But the precise details and wording remain open.

ianlancetaylor on 3 Dec 2020

👍2 ❤1

Go: doc: define how sync/atomic interacts with memory model

Most helpful comment

All 74 comments

Related issues