Alertmanager: Release Alertmanager v0.15.0

Created on 23 Apr 2018  Â·  43Comments  Â·  Source: prometheus/alertmanager

I would like to start the discussion of doing either a v0.15.0 or a v0.15.0-rc.2 release.

@stuartnelson3 @simonpasquier @fabxc are there any blockers that I am not aware of?

If there is consensus I am more than happy to prepare the next release.

Most helpful comment

@stuartnelson3 Wednesday or Thursday sounds good.

would you like to do the release for 0.15.0?

For sure. Glad to do so in case there are no issues reported till then. :tada:

All 43 comments

There were two issues I wanted to fix before a next release, #1331 and https://github.com/prometheus/alertmanager/pull/1330. #1330 is blocking @TheTincho from making a debian release, but I believe that still doesn't resolve the issue of our elm frontend .. i.e. even with #1330, he won't be able to release AM as an official debian package.

I haven't had a chance to test the new mesh lib at soundcloud, which I would like to do before making an official release, but I'm also not aware of the level of testing you've done with it over at redhat. @simonpasquier commented on a PR that he thought it was stable, I would be interested in hearing his opinion.

Hi,

Yes, #1330 is indeed a blocker for me now.

The Elm issue has not gone away, but I have basically abandoned any hope of getting Elm into Debian (React does not look better in this aspect). So I have started working last week of producing an official release without any web frontend.

My plan would be to upload this to experimental so users can avail of the newer AM, and meanwhile work on backporting the old simpler frontend to AM 0.14.

Correct, my own tests with 0.15 are conclusive (see my original comment). That being said, it is limited to my local environment and although I've played a bit with ambench, it can't be compared to any (pre-)production setup. We also had reports from users deploying successfully 0.15-RC versions with the Prometheus operator which is encouraging. Maybe @iksaif has done some testing too?

My feeling is that people eager to test the RC have done it already and the major issues regarding the clustering have been addressed (except for DNS resolution #1307 but AFAIU it isn't a blocking problem). Getting 0.15 out of the door would help surfacing new issues if any. And in case of blocker, downgrading from 0.15 to 0.14 works fine since the definition of the silences and notification logs on disk hasn't changed (I've just checked this).

In that case, my chief concern then is that the release makes it explicit that mesh configuration requires FQDNs.

EDIT:
There are also breaking changes in amtool.

I have 0.15rc1 running in a limited environment and it seems to work. But
it doesn't get much traffic.

Maybe would be nice to have an integration test that generates a few
hundred of alerts and silences for ~20min.

On Tue, Apr 24, 2018 at 10:42 AM, stuart nelson notifications@github.com
wrote:

In that case, my chief concern then is that the release makes it explicit
that mesh configuration requires FQDNs.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prometheus/alertmanager/issues/1340#issuecomment-383852509,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA_DA5tRJsJljJcEBGuXOjvKkQTFPPZNks5truWSgaJpZM4TgFkH
.

--
Corentin Chary
http://xf.iksaif.net

@stuartnelson3 we have been testing the new Alertmanager with the Prometheus Operator test suit and deployed it in some dev environments. No production experience so far.

There are also breaking changes in amtool.

@stuartnelson3 Are these changes in the release-0.15 branch? I would suggest only cherry-picking bug fixes from the master branch into the release branch.

My feeling is that people eager to test the RC have done it already and the major issues regarding the clustering have been addressed (except for DNS resolution #1307 but AFAIU it isn't a blocking problem). Getting 0.15 out of the door would help surfacing new issues if any. And in case of blocker, downgrading from 0.15 to 0.14 works fine since the definition of the silences and notification logs on disk hasn't changed (I've just checked this).

@simonpasquier I aggree.

I just started running a cluster at soundcloud and would like to investigate it more thoroughly. For example, it appears that even after having run for a couple hours, all instances are sending notifications. Given that we are using the default peer timeout (15s), and it doesn't appear that we're surpassing that timeout, I would expect just a single instance to be sending notifications based on how clusterWait is defined (https://github.com/prometheus/alertmanager/blob/master/cmd/alertmanager/main.go#L408-L412). I'm dubious that our mesh is so unstable that all instances are sending notifications so regularly.

image

Perhaps I'm misunderstanding something about the mesh and someone can clarify for me what I'm seeing.

@stuartnelson3 Are these changes in the release-0.15 branch? I would suggest only cherry-picking bug fixes from the master branch into the release branch.

@mxinden What's your reasoning? It feels arbitrary not to release changes made to amtool, especially when they're preventing @TheTincho from releasing AM as a debian package, and instead release the new clustering lib then wait some period of time and then release the amtool changes with whatever other changes that have been merged since then.

@stuartnelson3 In regards to all instances sending notifications: Would you mind creating a new issue with your configuration and status page? I would test it out on one of our dev clusters.

In regards to not including new features in release candidates: This strategy is not amtool related. My preference would be to not include new features (only bug fixes) between rc and release to preserve its tested stability.

If the current release-0.15 amtool version is not compatible with the current release-0.15 alertmanager binary, I would see those new amtool features as bug fixes.

This is not a strong opinion just a suggestion. How have we been handling this before?

@mxinden will send the issue soon, it's half-way written on my other laptop :)

In regards to not including new features in release candidates: This strategy is not amtool related. My preference would be to not include new features (only bug fixes) between rc and release to preserve its tested stability.

I have to check the changelog, but I believe all non-bugfixes have been in amtool.

Alertmanager's API hasn't changed, so amtool should be compatible. The changes are a couple flag names, moving to a stable underlying CLI library, and fixing a bug.

This is not a strong opinion just a suggestion. How have we been handling this before?

I'm not sure, unfortunately. I don't think we have a set policy. I would prefer to get the amtool changes out as those won't affect running alertmanager, and whoever does the release can check for changes to alertmanager itself and decide if they're a risk to stability.

Can 0.15 please include https://github.com/prometheus/alertmanager/pull/1339 as a trivial fix?

I've been running rc1 and it's been absolutely great. Thank you, developers.

There is still the generally confusing issue explained in #1341, but having run a version of master that includes all the changes in v0.15.0-rc.1 and some bug fixes, I think we can release this. In general, the multi-send seems to be an issue with pipeline creation that only occurs rarely, and fixing that (i.e. syncing pipeline execution) will take more time to test and verify.

@mxinden do you have time to work on the release?

@stuartnelson3 Sounds good. I will try my best in the upcoming two days.

I have created a release-0.15 branch. https://github.com/prometheus/alertmanager/pull/1363 cherry-picks discussed commits from master into release-0.15. https://github.com/prometheus/alertmanager/pull/1364 bumps VERSION and adds changes to CHANGELOG.md.

Let me know what you think.

According to @grobie, there seems to be a higher rate of more than one peer sending alerts than with the previous mesh library.

I think we should address this before releasing a version that doesn't support some form of synchronization of pipeline execution between peers.

I don't know whether it's multiple peers sending. We definitely see a significantly increased number of duplicated notifications sent with 0.15, while the network and instances are healthy.

Is there anything that looks suspicious when checking at the cluster metrics (eg alertmanager_cluster_.+)?

Nope. We spent some hours again this morning investigating the situation, added more debug logging and released the new version. Will work on that again this Friday.

We experienced a network partition over the weekend during which every alertmanager was isolated. after the partition, the mesh did not recover. at the time of this writing, the 4 instances have been running independently for over 48 hours. alertmanager_cluster_health_score and alertmanager_cluster_messages_queued reflect the current bad state of the mesh.

we are actively looking at this and think that an official 0.15.0 release is not ready until we can figure out the root cause.

@mxinden is there a test case for this in the prometheus operator acceptance tests?

Confirmed in a test using iptables between two machines in the same datacenter, once nodes are marked dead they will not rejoin without restarting memberlist.

Indeed once a node has left the cluster (eg on connection time-out), memberlist on its own will never try to reconnect to it. Serf (which is heavily using memberlist) handles the reconnection with a background task:
https://github.com/hashicorp/serf/blob/80ab48778deee28e4ea2dc4ef1ebb2c5f4063996/serf/serf.go#L1426-L1437
https://github.com/hashicorp/serf/blob/80ab48778deee28e4ea2dc4ef1ebb2c5f4063996/serf/serf.go#L1484-L1523

AIUI AlertManager needs to deal with it in a similar fashion.

@mxinden is there a test case for this in the prometheus operator acceptance tests?

@stuartnelson3 there is none at the moment, sorry. I will look into improving that for the future.

we are actively looking at this and think that an official 0.15.0 release is not ready until we can figure out the root cause.

I agree. Thanks a lot for looking into this.

Investigating another duplicate (where the pipelines are firing at the correct time):

- peer0 sends correctly
- peer2 and peer3 receive the nflog notification
- peer1 does not receive the nflog notification. 15seconds pass, and it sends a duplicate.

I'm trying to understand why the peer1 doesn't receive the updated state within 15s time, which is leading me to reading about how memberlist works. I'm trying to figure out:

- how many times a gossip message is sent (TransmitLimitedQueue#GetBroadcasts)
- how to force all nodes to be gossiped to (utils.go#kRandomNodes)

I would prefer not having to do a full state sync every 10s to ensure the messages have propagated, but that would probably stop the duplicates. The issue then is we're not gossiping, but maybe that's fine :man_shrugging:.

edit:
This has been moved to its own issue: https://github.com/prometheus/alertmanager/issues/1387

Any ETA on the new release?

I've been addressing what I see as the two main issues to a new release:

I've just returned from a weeks vacation and need to pick the work back up again. I hope to finish these in the next two weeks and get a release out.

How long will v0.15.0-rc.2 be monitored for stability before an official v0.15.0 is released? There's a lot of great stuff in here, so I'm very excited to start using it in an official release.

We're currently running it at SoundCloud. It's looking stable, so I'm hoping we can release it on Wednesday or Thursday.

@mxinden unless you have any objections, would you like to do the release for 0.15.0?

@stuartnelson3 Wednesday or Thursday sounds good.

would you like to do the release for 0.15.0?

For sure. Glad to do so in case there are no issues reported till then. :tada:

One issue we need to deal with:

The internal queue of messages to gossip is currently unbounded, and only messages under a certain size can be sent (currently 1400 bytes (configurable) - msg_overhead (memberlist messages, usually a few byte)). Any messages that push the total gossip size past this "gossip size limit" remain in the queue and are attempted to be sent next time ... but they can never be sent, because they're too large.

Looking at our setup, we have some alerts that can greatly exceed this max size for a gossip message:

2018-06-12_12:02:03.97148 [DEBUG] memberlist: message not sent because over limit size_total=156746 overhead=3 size_used=0 size_msg=156743 limit=1398 position=1/2

These are just some internal log lines I added, but the important part is that overhead + size_used + size_msg needs to be LESS than limit. Because size_msg is way larger than limit, this will sit in the queue forever.

The messages slowly pile up:

2018-06-12_12:16:20.77173 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=12/13
2018-06-12_12:16:20.77174 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=11/13
2018-06-12_12:16:20.77175 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=10/13
2018-06-12_12:16:20.77176 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=9/13
2018-06-12_12:16:20.77177 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=8/13
2018-06-12_12:16:20.77178 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=7/13
2018-06-12_12:16:20.77179 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=6/13
2018-06-12_12:16:20.77185 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=5/13
2018-06-12_12:16:20.77186 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=4/13
2018-06-12_12:16:20.77189 [DEBUG] memberlist: message not sent because over limit size_total=157099 overhead=3 size_used=0 size_msg=157096 limit=1398 position=3/13
2018-06-12_12:16:20.77190 [DEBUG] memberlist: message not sent because over limit size_total=156746 overhead=3 size_used=0 size_msg=156743 limit=1398 position=2/13
2018-06-12_12:16:20.77191 [DEBUG] memberlist: message not sent because over limit size_total=156746 overhead=3 size_used=0 size_msg=156743 limit=1398 position=1/13

Now the hard part ... how do we solve this? Do we not gossip messages that are too large, and raise the limit? Do we hack up any messages that exceed our byte limit and send them piecemeal? Do we attempt to make a hash of hashes for "large" messages, and fall back to just doing a direct comparison? What do we set the limit to?

@brancz @fabxc @simonpasquier @mxinden @brian-brazil

Are we sure this is actually a single message that we're talking about? To me this seems like the Alertmanager instances are trying to gossip the FullState at once, and that's simply too large.

Indeed I can reproduce the problem (for instance by creating a silence with a very long comment). Eventually the silence gets replicated when the nodes do a full-state sync but the queue doesn't drain the messages which are too large. I can't see an easy solution for this except prune the queue regularly?

How about a reasonable limit for comments and extract comments out of the silencelog entry, and the same for anything that's currently not limited. If the message size limit is way to small we should of course adapt that as well.

The logs above were for a single nflog that had a couple hundred alerts in it. I don't think there's a good way to limit the number of alerts we gossip, otherwise we're building a gossip that will always double-notify for large nflogs. I'm thinking a best-effort hashing-the-hashes might be the best option ..?

alertmanager with the latest changes for sending large messages via tcp seems to be doing the right thing.

I'm gone without internet access for the next two weeks. I propose that @mxinden releases the final rc, @grobie keeps an eye on the alertmanager dashboard over the weekend to make sure it's working correctly, but then @mxinden can release 0.15.0 early next week? How does this sound

@stuartnelson3 That sounds good to me.

@grobie keeps an eye on the alertmanager dashboard over the weekend to make sure it's working correctly

@grobie did you ran into any issues on the new v0.15.0-rc3 release candidate in the last couple of days?

I don't see any blocking issues reported by the community so far. :tada:

I mentioned in it #prometheus-dev earlier today, looks all good from what I
can see. Go for it!

On Wed, Jun 20, 2018, 20:13 Max Inden notifications@github.com wrote:

@grobie https://github.com/grobie keeps an eye on the alertmanager
dashboard over the weekend to make sure it's working correctly

@grobie https://github.com/grobie did you ran into any issues on the
new v0.15.0-rc3 release candidate in the last couple of days?

I don't see any blocking issues reported by the community so far. 🎉

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/prometheus/alertmanager/issues/1340#issuecomment-398846129,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAANaPZzGgq0SduF2LcXsBbe26uKQpRcks5t-pDigaJpZM4TgFkH
.

+1 Sounds great

Who's planning on doing the release?

@brian-brazil I will prepare it today.

Great!

With https://github.com/prometheus/alertmanager/pull/1429 merged, I will close here. Feel free to reopen if there are any questions.

Was this page helpful?
0 / 5 - 0 ratings