Bug
We are running 2 node alertmanager cluster. Its been running just fine from past few months, until tomorrow. We noticed that after adding few new receivers and route configurations, and reloading the configuration by (kill -s SIGHUP), the alertmanager immediately spitted out some messages like this:
2018-12-23T10:14:50.812179+00:00 prometheus1 alertmanager: level=info ts=2018-12-23T10:14:50.811699345Z caller=main.go:322 msg="Loading configuration file" file=/opt/alertmanager/alertmanager.yml
2018-12-23T10:14:50.817745+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817560618Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-23T10:14:50.817893+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817580974Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-23T10:14:50.817981+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817600149Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"
2018-12-23T10:14:50.818057+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817597885Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"
As I mentioned that we have 2 node cluster, we didn't notice any such error messages on the second server, after reloading configuration. We didn't see those errors prior to this and error messages do not make mention of the alert or receiver which is failing. This happened twice, right after reloading alertmanager.
I have already take a look at #282 and #346 but firstly I think its different because this happened only right after reloads and secondly the error message are of not much help because there is no mention of the alert which wasn't dispatched.
Environment
System information:
Linux 4.4.110-1.el7.elrepo.x86_64 x86_64
Alertmanager version:
alertmanager, version 0.15.2 (branch: HEAD, revision: d19fae3bae451940b8470abb680cfdd59bfa7cfa)
build user: root@3101e5b68a55
build date: 20180814-10:53:39
go version: go1.10.3
Prometheus version:
prometheus, version 2.4.3 (branch: HEAD, revision: 167a4b4e73a8eca8df648d2d2043e21bdb9a7449)
build user: root@1e42b46043e9
build date: 20181004-08:42:02
go version: go1.11.1
Alertmanager configuration file:
Standard configuration working well otherwise.
Standard configuration working well otherwise.
2018-12-23T10:14:50.812179+00:00 prometheus1 alertmanager: level=info ts=2018-12-23T10:14:50.811699345Z caller=main.go:322 msg="Loading configuration file" file=/opt/alertmanager/alertmanager.yml
2018-12-23T10:14:50.817745+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817560618Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-23T10:14:50.817893+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817580974Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-23T10:14:50.817981+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817600149Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"
2018-12-23T10:14:50.818057+00:00 prometheus1 alertmanager: level=error ts=2018-12-23T10:14:50.817597885Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"
2018-12-24T06:13:27.917149+00:00 prometheus1 alertmanager: level=info ts=2018-12-24T06:13:27.916675987Z caller=main.go:322 msg="Loading configuration file" file=/opt/alertmanager/alertmanager.yml
2018-12-24T06:13:27.921093+00:00 prometheus1 alertmanager: level=error ts=2018-12-24T06:13:27.920944281Z caller=notify.go:332 component=dispatcher msg="Error on notify" err="context canceled"
2018-12-24T06:13:27.921225+00:00 prometheus1 alertmanager: level=error ts=2018-12-24T06:13:27.92097041Z caller=dispatch.go:280 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="context canceled"
2018-12-25T01:17:26.809024+00:00 prometheus1 alertmanager: level=info ts=2018-12-25T01:17:26.808452365Z caller=main.go:322 msg="Loading configuration file" file=/opt/alertmanager/alertmanager.ymlcan you check if the alertmanager_notifications_failed_total metric doesn't change on this?
an alert would look like rate(alertmanager_notifications_failed_total[1m]) > 0 you need to have the /mertrics endpoint of the alertmanager for this metric.
thank you @geekodour, we will start monitoring that metric (alertmanager_notifications_failed_total).
But questions still remains: we need to find out the root cause of this (why this happened out-of-blue on configuration reloads) and we do not know the name of alert which didn't get dispatched. Any ideas?
I get the same errors on reload and have alertmanager_notifications_failed_total 0 going back 12h. It would be very helpful to have more detail in these messages.
My explanation is that when the configuration gets reloaded, AlertManager stops the notification's dispatcher which in turn cancels its context to stop the running aggregation groups. So the logs are "expected" but we need to do a better job at not logging this as error (though real errors still need to be logged as such).
may be related to https://github.com/prometheus/alertmanager/issues/1307 ?
You mention "routes changed", did you also made some change in the DNS ?
caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="gmail-notifications/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
Getting the above error on triggering alerts on GMAIL mail from prometheus via alertManager , any help