Alertmanager: Logging of failed notifications is insufficient

Created on 30 May 2016  路  8Comments  路  Source: prometheus/alertmanager

https://github.com/prometheus/alertmanager/blob/master/notify/notify.go#L193
results in messages like

2016-05-30_15:24:49.81027 time="2016-05-30T15:24:49Z" level=warning msg="Notify attempt 8 failed: unexpected status code 500" source="notify.go:193"

which isn't really helpful to find out which notification mechanism has failed on which alert.

kinenhancement

All 8 comments

This is still relevant in v0.5:

2017-04-07_11:40:08.39660 time="2017-04-07T11:40:08Z" level=debug msg="Notify attempt 1 failed: unexpected status code 404" source="notify.go:546" 
2017-04-07_11:40:08.39663 time="2017-04-07T11:40:08Z" level=error msg="Error on notify: Cancelling notify retry due to unrecoverable error: unexpected status code 404" source="notify.go:272" 
2017-04-07_11:40:08.39666 time="2017-04-07T11:40:08Z" level=error msg="Notify for 1 alerts failed: Cancelling notify retry due to unrecoverable error: unexpected status code 404" source="dispatch.go:265" 

(My guess is this particular incident is a Slack notification sent to a non-existent channel. Would be much easier to find out if a meaningful message were logged.)

What should we log then though? There's no upper bound on alerts in a notification. So maybe the grouping labels?

In this case, the URL that triggered the 404 would have been very helpful.

We need to be a little careful with some of the URLs, as they can contain auth tokens.

With --log.level=debug, the logged message currently shows:

level=debug ts=2018-04-05T12:52:10.854506281Z caller=notify.go:629 
        component=dispatcher msg="Notify attempt failed" attempt=5
        integration=webhook receiver=web.hook
        err="Post http://127.0.0.1:5001/: dial tcp 127.0.0.1:5001: connect: connection refused"

added in 23f31d7d

@brian-brazil maybe good solution would be to add an option --log.urls=true

@stuartnelson3 --log.level=debug doesn't show an url anymore (the latest Prometheus v2.3.0 and Alertmanager v0.13.0) :(

alertmanager    | level=debug ts=2018-06-20T14:19:18.524718619Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=up[4b31f10][active]
alertmanager    | level=debug ts=2018-06-20T14:19:18.525358278Z caller=dispatch.go:429 component=dispatcher aggrGroup="{}/{severity=\"hipchat\"}:{alertname=\"up\"}" msg=Flushing alerts=[up[4b31f10][active]]
alertmanager    | level=debug ts=2018-06-20T14:19:18.690725554Z caller=notify.go:605 component=dispatcher msg="Notify attempt failed" attempt=1 integration=hipchat receiver=demo-hipchat err="unexpected status code 404"

@tkrishtop the URL is currently only logged for the webhook receiver. As @brian-brazil noted above, it could be a concern for some URLs contain tokens which shouldn't end up in the logs.

@simonpasquier thank you.

For the ones which will follow and maybe fall in the same trap:

If your hipchat integration curl is

curl -d '{<body>}' -H 'Content-Type: application/json' https://my.nice.website.com/v2/room/<number>/notification?auth_token=<mynicetoken>

Do not add API version in hipchat API url, i.e. write in alertmanager.yml something like

hipchat_api_url: 'https://my.nice.website.com/'

and not

hipchat_api_url: 'https://my.nice.website.com/v2/'
Was this page helpful?
0 / 5 - 0 ratings