Alertmanager: Alertmanager sends duplicate "resolved"

Created on 29 Aug 2017  路  11Comments  路  Source: prometheus/alertmanager

Hi.
Alertmanager sends 3 messages with "resolved", after "firing".
As I understood, prometheus sends the status of the alert every evaluation period.
So, here is some debug information:
I just created one test job in prometheus (which uses blackbox_exporter as a tcp port check service):
prometheus.yml

global:
  scrape_interval: 10s
  scrape_timeout: 8s
  evaluation_interval: 1m
rule_files:
  - /etc/prometheus/production.rules
scrape_configs:
- job_name: 'local_tcp_check'
  scrape_interval: 30s
  metrics_path: /probe
  params:
    module: [ tcp_connect ]
  static_configs:
  - targets:
    - localhost:9116
    labels:
      nodetype: test
  relabel_configs:
  - source_labels: [__address__]
    regex: (.*):(.*)
    target_label: __param_target
    replacement: ${1}:${2}
  - source_labels: [__param_target]
    regex: (.*)
    target_label: instance
    replacement: ${1}
  - source_labels: []
    regex: .*
    target_label: __address__
    replacement: 127.0.0.1:9115  # Blackbox exporter.

blackbox_exporter.yml

  tcp_connect:
    prober: tcp
    timeout: 5s

/etc/prometheus/production.rules

ALERT LOCAL_TCP
        IF probe_success {job=~"local_tcp_check"} != 1
         FOR 2m
        LABELS { severity = "critical" }
        ANNOTATIONS {
                summary = " pls ignore, duplicate messsages test {{ $labels.instance }}   ",
                description = " pls ignore, duplicate messsages test {{ $labels.instance }}  ",
        }

alertmanager.yml

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'smtp9.service.com:25'
  smtp_from: '[email protected]'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  resolve_timeout: 1h
# The directory from which notification templates are read.

# The root route on which each incoming alert enters.
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5m

route:
 group_by: ['alertname']
 receiver: slack_general
 routes:
  - match:
      severity: sms
    receiver: sms_admins
    continue: true
  - match:
      severity: warning
    receiver: slack_general
    repeat_interval: 1h
    continue: true
  - match:
      severity: warning
    receiver: admins
    repeat_interval: 1h
  - match_re:
      severity: critical|sms
    receiver: slack_general
    continue: true
  - match_re:
      severity: critical|sms
    receiver: admins
receivers:
- name: slack_general
  slack_configs:
  - api_url: https://hooks.slack.com/services/XXX/XXX/XXX
    channel: '#notifications'
    send_resolved: true
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] '
    text:  '{{ range $val := .Alerts }}
             Alert: <http://alertmanager.url.com/#/alerts|{{ index $val.Annotations "description" }}>
             {{ end}}'
- name: admins
  webhook_configs:
  - send_resolved: True
    url: http://127.0.0.1:9088/alert/chatid1,chatid2,chatid3
- name: caller
  webhook_configs:
  - send_resolved: True
    url: http://127.0.0.1:9089/alert/handle
- name: sms_admins
  email_configs:
  - to: '[email protected]'
    require_tls: false
    send_resolved: true
    headers:
      subject: 'some subject'
templates:
- '/etc/prometheus/alertmanager/templates/default.tmpl'
#test for rule inhibition, used for free space trigger for now
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance', 'mountpoint']

Then I stopped service, which listens the port 9116 on localhost, to check the alerting flow.
I waited when service passes into a status "firing", then I launched service on localhost:9116 again and with intervals 5 minutes, 3 notification messages with status "resolved" were sent to me in slack.

Here is how it looked:
http://joxi.net/KAg09J9igV5nLr

Here is some debug information from alertmanager:

[root@prometheus ~]# journalctl -f -n 500  -u alertmanager | grep LOCAL_TCP
Aug 29 15:44:40 prometheus alertmanager[13966]: time="2017-08-29T15:44:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:45:10 prometheus alertmanager[13966]: time="2017-08-29T15:45:10Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 15:45:10 prometheus alertmanager[13966]: time="2017-08-29T15:45:10Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 15:45:40 prometheus alertmanager[13966]: time="2017-08-29T15:45:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:46:40 prometheus alertmanager[13966]: time="2017-08-29T15:46:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:47:40 prometheus alertmanager[13966]: time="2017-08-29T15:47:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:48:40 prometheus alertmanager[13966]: time="2017-08-29T15:48:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:49:40 prometheus alertmanager[13966]: time="2017-08-29T15:49:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:50:10 prometheus alertmanager[13966]: time="2017-08-29T15:50:10Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 15:50:10 prometheus alertmanager[13966]: time="2017-08-29T15:50:10Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 15:50:40 prometheus alertmanager[13966]: time="2017-08-29T15:50:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:51:40 prometheus alertmanager[13966]: time="2017-08-29T15:51:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:52:40 prometheus alertmanager[13966]: time="2017-08-29T15:52:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:53:40 prometheus alertmanager[13966]: time="2017-08-29T15:53:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:54:40 prometheus alertmanager[13966]: time="2017-08-29T15:54:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:55:10 prometheus alertmanager[13966]: time="2017-08-29T15:55:10Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 15:55:10 prometheus alertmanager[13966]: time="2017-08-29T15:55:10Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][active]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 15:55:41 prometheus alertmanager[13966]: time="2017-08-29T15:55:41Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:56:40 prometheus alertmanager[13966]: time="2017-08-29T15:56:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][active] component=dispatcher source="dispatch.go:184"
Aug 29 15:57:40 prometheus alertmanager[13966]: time="2017-08-29T15:57:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 15:58:40 prometheus alertmanager[13966]: time="2017-08-29T15:58:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 15:59:40 prometheus alertmanager[13966]: time="2017-08-29T15:59:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:00:10 prometheus alertmanager[13966]: time="2017-08-29T16:00:10Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:00:10 prometheus alertmanager[13966]: time="2017-08-29T16:00:10Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:00:40 prometheus alertmanager[13966]: time="2017-08-29T16:00:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:00:40 prometheus alertmanager[13966]: time="2017-08-29T16:00:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:00:40 prometheus alertmanager[13966]: time="2017-08-29T16:00:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:01:40 prometheus alertmanager[13966]: time="2017-08-29T16:01:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:01:40 prometheus alertmanager[13966]: time="2017-08-29T16:01:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:01:40 prometheus alertmanager[13966]: time="2017-08-29T16:01:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:02:41 prometheus alertmanager[13966]: time="2017-08-29T16:02:41Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:02:41 prometheus alertmanager[13966]: time="2017-08-29T16:02:41Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:02:41 prometheus alertmanager[13966]: time="2017-08-29T16:02:41Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:03:40 prometheus alertmanager[13966]: time="2017-08-29T16:03:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:03:40 prometheus alertmanager[13966]: time="2017-08-29T16:03:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:03:40 prometheus alertmanager[13966]: time="2017-08-29T16:03:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:04:40 prometheus alertmanager[13966]: time="2017-08-29T16:04:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:04:40 prometheus alertmanager[13966]: time="2017-08-29T16:04:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:04:40 prometheus alertmanager[13966]: time="2017-08-29T16:04:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:05:40 prometheus alertmanager[13966]: time="2017-08-29T16:05:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:05:40 prometheus alertmanager[13966]: time="2017-08-29T16:05:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:05:40 prometheus alertmanager[13966]: time="2017-08-29T16:05:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:06:40 prometheus alertmanager[13966]: time="2017-08-29T16:06:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:06:40 prometheus alertmanager[13966]: time="2017-08-29T16:06:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:06:40 prometheus alertmanager[13966]: time="2017-08-29T16:06:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:07:40 prometheus alertmanager[13966]: time="2017-08-29T16:07:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:07:40 prometheus alertmanager[13966]: time="2017-08-29T16:07:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:07:40 prometheus alertmanager[13966]: time="2017-08-29T16:07:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:08:40 prometheus alertmanager[13966]: time="2017-08-29T16:08:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:08:40 prometheus alertmanager[13966]: time="2017-08-29T16:08:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:08:40 prometheus alertmanager[13966]: time="2017-08-29T16:08:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:09:42 prometheus alertmanager[13966]: time="2017-08-29T16:09:42Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:09:42 prometheus alertmanager[13966]: time="2017-08-29T16:09:42Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:09:42 prometheus alertmanager[13966]: time="2017-08-29T16:09:42Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:10:40 prometheus alertmanager[13966]: time="2017-08-29T16:10:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:10:40 prometheus alertmanager[13966]: time="2017-08-29T16:10:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:10:40 prometheus alertmanager[13966]: time="2017-08-29T16:10:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:11:40 prometheus alertmanager[13966]: time="2017-08-29T16:11:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:11:40 prometheus alertmanager[13966]: time="2017-08-29T16:11:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:11:40 prometheus alertmanager[13966]: time="2017-08-29T16:11:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:12:40 prometheus alertmanager[13966]: time="2017-08-29T16:12:40Z" level=debug msg="Received alert" alert=LOCAL_TCP[5afcd55][resolved] component=dispatcher source="dispatch.go:184"
Aug 29 16:12:40 prometheus alertmanager[13966]: time="2017-08-29T16:12:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"
Aug 29 16:12:40 prometheus alertmanager[13966]: time="2017-08-29T16:12:40Z" level=debug msg="flushing [LOCAL_TCP[5afcd55][resolved]]" aggrGroup={}/{severity=~"^(?:critical|sms)$"}:{alertname="LOCAL_TCP"} source="dispatch.go:426"

In a screenshot from slack, time +3 hours in comparison with alertmanager debug log time.
Here is the body of requests from prometheus to alertmanager:

[{"labels":{"alertname":"LOCAL_TCP","instance":"localhost:9116","job":"local_tcp_check","nodetype":"test","severity":"critical"},"annotations":{"description":" pls ignore, duplicate messsages test localhost:9116  ","summary":" pls ignore, duplicate messsages test localhost:9116   "},"startsAt":"2017-08-29T15:44:37.866Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://alertmanager.uri/graph?g0.expr=probe_success%7Bjob%3D~%22local_tcp_check%22%7D+%21%3D+1\u0026g0.tab=0"}]
[{"labels":{"alertname":"LOCAL_TCP","instance":"localhost:9116","job":"local_tcp_check","nodetype":"test","severity":"critical"},"annotations":{"description":" pls ignore, duplicate messsages test localhost:9116  ","summary":" pls ignore, duplicate messsages test localhost:9116   "},"startsAt":"2017-08-29T15:44:37.866Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://alertmanager.uri/graph?g0.expr=probe_success%7Bjob%3D~%22local_tcp_check%22%7D+%21%3D+1\u0026g0.tab=0"}]
...

Then was an alert with "endsAt" datetime:

[{"labels":{"alertname":"LOCAL_TCP","instance":"localhost:9116","job":"local_tcp_check","nodetype":"test","severity":"critical"},"annotations":{"description":" pls ignore, duplicate messsages test localhost:9116  ","summary":" pls ignore, duplicate messsages test localhost:9116   "},"startsAt":"2017-08-29T15:44:37.866Z","endsAt":"2017-08-29T15:57:39.198Z","generatorURL":"http://alertmanager.uri/graph?g0.expr=probe_success%7Bjob%3D~%22local_tcp_check%22%7D+%21%3D+1\u0026g0.tab=0"}]
[{"labels":{"alertname":"LOCAL_TCP","instance":"localhost:9116","job":"local_tcp_check","nodetype":"test","severity":"critical"},"annotations":{"description":" pls ignore, duplicate messsages test localhost:9116  ","summary":" pls ignore, duplicate messsages test localhost:9116   "},"startsAt":"2017-08-29T15:44:37.866Z","endsAt":"2017-08-29T15:57:39.198Z","generatorURL":"http://alertmanager.uri/graph?g0.expr=probe_success%7Bjob%3D~%22local_tcp_check%22%7D+%21%3D+1\u0026g0.tab=0"}]
[{"labels":{"alertname":"LOCAL_TCP","instance":"localhost:9116","job":"local_tcp_check","nodetype":"test","severity":"critical"},"annotations":{"description":" pls ignore, duplicate messsages test localhost:9116  ","summary":" pls ignore, duplicate messsages test localhost:9116   "},"startsAt":"2017-08-29T15:44:37.866Z","endsAt":"2017-08-29T15:57:39.198Z","generatorURL":"http://alertmanager.uri/graph?g0.expr=probe_success%7Bjob%3D~%22local_tcp_check%22%7D+%21%3D+1\u0026g0.tab=0"}]
[{"labels":{"alertname":"LOCAL_TCP","instance":"localhost:9116","job":"local_tcp_check","nodetype":"test","severity":"critical"},"annotations":{"description":" pls ignore, duplicate messsages test localhost:9116  ","summary":" pls ignore, duplicate messsages test localhost:9116   "},"startsAt":"2017-08-29T15:44:37.866Z","endsAt":"2017-08-29T15:57:39.198Z","generatorURL":"http://alertmanager.uri/graph?g0.expr=probe_success%7Bjob%3D~%22local_tcp_check%22%7D+%21%3D+1\u0026g0.tab=0"}]
[{"labels":{"alertname":"LOCAL_TCP","instance":"localhost:9116","job":"local_tcp_check","nodetype":"test","severity":"critical"},"annotations":{"description":" pls ignore, duplicate messsages test localhost:9116  ","summary":" pls ignore, duplicate messsages test localhost:9116   "},"startsAt":"2017-08-29T15:44:37.866Z","endsAt":"2017-08-29T15:57:39.198Z","generatorURL":"http://alertmanager.uri/graph?g0.expr=probe_success%7Bjob%3D~%22local_tcp_check%22%7D+%21%3D+1\u0026g0.tab=0"}]
...

Versions of software:

[root@prometheus ~]# prometheus -version
prometheus, version 1.6.1 (branch: master, revision: 4666df502c0e239ed4aa1d80abbbfb54f61b23c3)
  build user:       root@ca1b452514b7
  build date:       20170419-13:26:24
  go version:       go1.8.1
[root@prometheus ~]# alertmanager -version
alertmanager, version 0.8.0 (branch: HEAD, revision: 74e7e48d24bddd2e2a80c7840af9b2de271cc74c)
  build user:       root@439065dc2905
  build date:       20170720-14:14:06
  go version:       go1.8.3

If additional information is necessary, I will provide it!
Thank you!

componennotify kinbug

Most helpful comment

Resolved notifications should not be repeated. I don't think this is related to #1005, the result looks a bit similar, but in #1005 the multiple Alertmanager instances send the same notification at the same time, here it's more about repeating resolved notifications incorrectly.

All 11 comments

The same problem is reproduced on the new version of prometheus 2.0.0-beta.4.
The problem is bound to the number of receivers for identical serverity.
For example if i have two receivers for severity "critical":

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'smtp.server.com:25'
  smtp_from: '[email protected]'
  smtp_auth_username: 'alertmanager'
  smtp_auth_password: 'password'
  resolve_timeout: 1h
# The directory from which notification templates are read.

# The root route on which each incoming alert enters.
route:
 group_by: ['alertname']
 receiver: slack_general
 routes:
  - match_re:
      severity: critical|sms
    receiver: slack_general
receivers:
- name: slack_general
  slack_configs:
  - api_url: https://hooks.slack.com/services/XXX/XXX/XXX
    send_resolved: true
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] '
    text:  '{{ range $val := .Alerts }}
             Alert: <http://server:9095/#/alerts|{{ index $val.Annotations "description" }}>
             {{ end}}'
  webhook_configs:
  - send_resolved: True
    url: http://127.0.0.1:9088/alert/MY_NUMBER

templates:
- '/etc/prometheus/alertmanager/templates/default.tmpl'

As a result we receive two "resolved" messages since in group two receivers and both have flag:

  • send_resolved: True

Additional information: even if to remove send_resolved at one of receivers, the "resolved" message will come twice.

I am seeing the same problem.
Prometheus version 1.7.1 (also tried 1.7.0)
Alertmanager version 0.8.0

Both running in the official docker containers.

Example of the alerts:

14:04 | | [FIRING:1] ContainerStack rancher-api (healthy exporter-metrics critical)
14:05 | | [RESOLVED] ContainerStack rancher-api (healthy exporter-metrics critical)
14:11 | | [RESOLVED] ContainerStack rancher-api (healthy exporter-metrics critical)
14:16 | | [RESOLVED] ContainerStack rancher-api (healthy exporter-metrics critical)

The Debug logs of alertmanager show these two lines repeatedly even after the first RESOLVED message is sent until the last RESOLVED message is sent then they stop:

DEBU[7707] Received alert alert=ContainerStack[0eb0ba1][resolved] component=dispatcher source="dispatch.go:184"
DEBU[7707] flushing [ContainerStack[0eb0ba1][resolved]] aggrGroup={}:{alertname="ContainerStack", job="rancher-api"} source="dispatch.go:426"

The problem gets worse if you have critical/warning alerts for a system, when the critical fires, the warning is inhibited (as expected with the inhibit rules), but once the RESOLVED message comes in for Critical, you then get multiple RESOLVED messages for both the critical and warning alerts because the inhibit rules no longer apply once the critical is "resolved."

I can post the configs if requested but it looks like @artemlive covered all that pretty well. I do only have a single receiver (slack_general), not two, and I removed all routes but the root route, so I don't think it's related to the receivers or routes. It also isn't always "3" for me, I've gotten 4 or more from alerts, but these usually happen when more than one alert was sent for the same alertname (either because repeat_interval passed, or multiple targets alerted).

Hello, @fatalsaint !
Thank you for your reply.

so I don't think it's related to the receivers or routes

I tried to change the number of recipients and the number of "resolved" messages was equal to the number of recipients in my case.

I suppose we might have different problems with the same symptom as I only had the one receiver but I'd imagine whatever the underlying cause is works for both.

In any case, after commenting here and realizing I wasn't the only one with the problem I accepted the noise and setup Alertmanager how we wanted in production. That is, I raised the root route repeat_interval to 24h and made sub routes for severity (like critical) as so:

route:
  receiver: 'general'
  group_by: ['alertname','job','env']
  group_wait: 30s
  group_interval: 1s
  repeat_interval: 24h
# Sub routes
  routes:
  - match:
      severity: critical
    repeat_interval: 1h

Once a day for warning and below was good enough for us, as we have escalation procedures built into Ops Genie. Since I have made that change I've been setting up prometheus with all the targets using EC2 scraping and the like causing a lot of alerts in the process and have not seen this problem resurface.

I had the default repeat_interval set to 2h and 5h before and still had duplicate resolved messages, so I don't know what it needs to be set to but 24h seems to have fixed my problem.

I have a sneaky suspicion that the resolved alerts always follow the default route, they don't match sub routes, but haven't done deep enough testing to know for sure.

What I noticed, which is the same in your screenshot, that the RESOLVED alerts were coming at the same internal as the repeat_interval (~5 minutes in our case). It's worth trying I think, even if you don't want 24h interval, if it works, you can make sub routes to match all your alerts and reduce it as a workaround.

I'm now able to simulate this consistently.

It will happen because you configured the repeat_interval to a very small time, and Prometheus will call alertmanager API a few times for the same alert, does not matter the state (active or resolved). To be very honest I'm not sure this is considered a bug, because as the configuration state:

How long to wait before sending a notification again if it has already 
been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> ]

The repeat_interval is considered for notifications that have already been sent successfully.

So, alertmanager expects that you want to resend the notification (active or resolved) because you did configure the repease_interval for a 5m window.

What's your view on this @stuartnelson3 ?

What would be the use case to sending multiple RESOLVED notifications? I understand what you're saying but it seems odd. Once an alert is RESOLVED why does Prometheus call alertmanager multiple times (is this in case alertmanager is down for the first one?), and why would alertmanager want to repeat RESOLVED notifications?

I think in the majority of cases people would want to be able to repeat open alerts however fast they want, but only want one notice when it is fixed, right?

I did have repeat_interval set to a couple hours in the past and still managed to get a duplicate RESOLVED message so how long does Prometheus send the same resolved "alert" to alertmanager?

In any case, thanks for looking into it and finding the explanation. It makes sense, it just seems odd.

Resolved notifications should not be repeated. I don't think this is related to #1005, the result looks a bit similar, but in #1005 the multiple Alertmanager instances send the same notification at the same time, here it's more about repeating resolved notifications incorrectly.

I have the same problem, after the alert resolved, prometheus keeps sending resolved alerts to alertmanager, it should send only one resolved alert.

I use blackbox exporter to monitor two simple http servers, and a custom webhook receiver.

related information:

$ prometheus --version
prometheus, version 2.0.0 (branch: non-git, revision: non-git)
  build user:       [email protected]
  build date:       20171111-16:10:49
  go version:       go1.9.2
$ alertmanager --version
alertmanager, version 0.11.0 (branch: master, revision: fe9159e6a357b4d594411ca1c189b0e4d57b1d4a)
  build user:       cj@localhost
  build date:       20171205-02:06:03
  go version:       go1.9

prometheus.yml

global:
  scrape_interval:     5s
  evaluation_interval: 5s

rule_files:
  - rule.yml

scrape_configs:
  - job_name: blackbox
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
          - http://localhost:8084
          - http://localhost:8085
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # Blackbox exporter.

alerting:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets:
            - localhost:9093

blackbox.yml

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      preferred_ip_protocol: "ip4" # defaults to "ip6"
      no_follow_redirects: false

alertmanager.yml, I make the interval very short to test.

global:
  resolve_timeout: 1m

route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 10s
  repeat_interval: 10s
  receiver: webhook

receivers:
- name: webhook
  webhook_configs:
    - send_resolved: true
      url: http://localhost:7006

alertmanager.log

level=info ts=2017-12-05T02:52:01.263601288Z caller=main.go:155 msg="Starting Alertmanager" version="(version=0.11.0, branch=master, revision=fe9159e6a357b4d594411ca1c189b0e4d57b1d4a)"
level=info ts=2017-12-05T02:52:01.26366316Z caller=main.go:156 build_context="(go=go1.9, user=cj@localhost, date=20171205-02:06:03)"
level=info ts=2017-12-05T02:52:01.264721461Z caller=main.go:293 msg="Loading configuration file" file=alertmanager.yml
level=info ts=2017-12-05T02:52:01.269253845Z caller=main.go:368 msg=Listening address=:9093
level=debug ts=2017-12-05T02:52:26.515791137Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][active]
level=debug ts=2017-12-05T02:52:26.516096716Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"down\"}" msg=Flushing alerts=[down[163508d][active]]
level=debug ts=2017-12-05T02:52:31.512160704Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][active]
level=debug ts=2017-12-05T02:52:36.510435848Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:52:36.516203567Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"down\"}" msg=Flushing alerts=[down[163508d][resolved]]
level=debug ts=2017-12-05T02:52:41.510156141Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:52:46.510621143Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:52:46.515990384Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"down\"}" msg=Flushing alerts=[down[163508d][resolved]]
level=debug ts=2017-12-05T02:52:51.510730726Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:52:56.509675541Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:52:56.516753204Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"down\"}" msg=Flushing alerts=[down[163508d][resolved]]
level=debug ts=2017-12-05T02:53:01.509621672Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:53:01.509836539Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"down\"}" msg=Flushing alerts=[down[163508d][resolved]]
level=debug ts=2017-12-05T02:53:06.510270007Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:53:11.509119999Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:53:11.510115262Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"down\"}" msg=Flushing alerts=[down[163508d][resolved]]
level=debug ts=2017-12-05T02:53:16.509571003Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:53:21.509019815Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:53:21.510016178Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"down\"}" msg=Flushing alerts=[down[163508d][resolved]]
level=debug ts=2017-12-05T02:53:26.509047408Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:53:31.509263273Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]
level=debug ts=2017-12-05T02:53:31.510794233Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"down\"}" msg=Flushing alerts=[down[163508d][resolved]]
level=debug ts=2017-12-05T02:53:36.509322253Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=down[163508d][resolved]

I increased the repeat_interval to 15 minutes and the problem is no longer reproduced.

@stuartnelson3 as for https://github.com/prometheus/alertmanager/issues/1125#issuecomment-396526604, I think that this issue can be closed.

Was this page helpful?
0 / 5 - 0 ratings