Alertmanager: Alertmanager memberlist not sending to all peers

Created on 17 May 2018 · 5Comments · Source: prometheus/alertmanager

Investigating another duplicate (where the pipelines are firing at the correct time):

peer0 sends correctly
peer2 and peer3 receive the nflog notification
peer1 does not receive the nflog notification. 15seconds pass, and it sends a duplicate.

After reading memberlist code, it seems that this is what's happening:

Every X sec, memberlist.gossip() is called https://github.com/hashicorp/memberlist/blob/master/state.go#L480

A random number of nodes is selected from the available members https://github.com/hashicorp/memberlist/blob/master/state.go#L485 (In our case, up to 3 nodes). From the documentation:

// kRandomNodes is used to select up to k random nodes, excluding any nodes where
// the filter function returns true. It is possible that less than k nodes are
// returned.

Queued messages are sent up to a certain limit (in our case, 3 times).

What I found in the logs is that peer0 gossiped its message to peer2 twice, and peer3 once. After sending its message 3 times, the message was removed from the broadcast queue. At this point, peer1 would only find out about this message as a result of a full state merge (which happens once every 60s).

The easiest solution is either to configure config.GossipNodes to an arbitrarily high number so that the probability of selecting all peers is essentially 1, or have alertmanagers propagate gossip messages that they haven't seen before, raising the probability (hopefully to 1) of all nodes seeing the message before cluster.peer-timeout.

I'm going to investigate both options.

Source

stuartnelson3

Most helpful comment

appears to be working:

the point at which all instances stop sending is when I deployed #1389 at soundcloud

I'll leave this running over the weekend.

stuartnelson3 on 18 May 2018

🎉2

All 5 comments

Configuring config.GossipNodes to an arbitrarily high value has no effect, as there is an upper bound on the number of nodes that will be gossiped to. Here's the peer selection code in question: https://play.golang.org/p/xOTwsHKyXkh

So, I'm adding message propagation from peers that haven't seen the message before. If a peer receives an nflog message and merges it into its internal state, then it will also gossip the message. This seems like a simple solution that will also halt propagation relatively quickly, but I might be missing something obvious. Am I missing something?

@brancz @fabxc @grobie @beorn7

stuartnelson3 on 18 May 2018

Your explanation makes perfect sense. Have you tried increasing TransmitLimitedQueue.RetransmitMult? That being said, what you described seems like a valid solution.

simonpasquier on 18 May 2018

I changed RetransmitMult to scale based on the size of the list of cluster.peers (with a minimum value of 3), and also set the nodes to gossip any messages they are receiving for the first time.

So far it seems to be working better than before. One weird side-effect is that peer0 has had 4 messages in its alertmanager_cluster_messages_queued for a while, and they're not being cleared out.

stuartnelson3 on 18 May 2018

👍1

appears to be working:

the point at which all instances stop sending is when I deployed #1389 at soundcloud

I'll leave this running over the weekend.

stuartnelson3 on 18 May 2018

🎉2

closed by #1389

stuartnelson3 on 8 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings