Investigating another duplicate (where the pipelines are firing at the correct time):
After reading memberlist code, it seems that this is what's happening:
Every X sec, memberlist.gossip() is called https://github.com/hashicorp/memberlist/blob/master/state.go#L480
A random number of nodes is selected from the available members https://github.com/hashicorp/memberlist/blob/master/state.go#L485 (In our case, up to 3 nodes). From the documentation:
// kRandomNodes is used to select up to k random nodes, excluding any nodes where
// the filter function returns true. It is possible that less than k nodes are
// returned.
Queued messages are sent up to a certain limit (in our case, 3 times).
What I found in the logs is that peer0 gossiped its message to peer2 twice, and peer3 once. After sending its message 3 times, the message was removed from the broadcast queue. At this point, peer1 would only find out about this message as a result of a full state merge (which happens once every 60s).
The easiest solution is either to configure config.GossipNodes to an arbitrarily high number so that the probability of selecting all peers is essentially 1, or have alertmanagers propagate gossip messages that they haven't seen before, raising the probability (hopefully to 1) of all nodes seeing the message before cluster.peer-timeout.
I'm going to investigate both options.
Configuring config.GossipNodes to an arbitrarily high value has no effect, as there is an upper bound on the number of nodes that will be gossiped to. Here's the peer selection code in question: https://play.golang.org/p/xOTwsHKyXkh
So, I'm adding message propagation from peers that haven't seen the message before. If a peer receives an nflog message and merges it into its internal state, then it will also gossip the message. This seems like a simple solution that will also halt propagation relatively quickly, but I might be missing something obvious. Am I missing something?
@brancz @fabxc @grobie @beorn7
Your explanation makes perfect sense. Have you tried increasing TransmitLimitedQueue.RetransmitMult? That being said, what you described seems like a valid solution.
I changed RetransmitMult to scale based on the size of the list of cluster.peers (with a minimum value of 3), and also set the nodes to gossip any messages they are receiving for the first time.
So far it seems to be working better than before. One weird side-effect is that peer0 has had 4 messages in its alertmanager_cluster_messages_queued for a while, and they're not being cleared out.
appears to be working:

the point at which all instances stop sending is when I deployed #1389 at soundcloud
I'll leave this running over the weekend.
closed by #1389
Most helpful comment
appears to be working:
the point at which all instances stop sending is when I deployed #1389 at soundcloud
I'll leave this running over the weekend.