Alertmanager: Resolved alert sent without sending corresponding active alert

Created on 21 Jun 2017 · 19Comments · Source: prometheus/alertmanager

I have alertmanager sending notifications to Slack. Today it sent a "resolved" notification to Slack without ever sending an "active" notification. This is contrary to my expectations - I expect "resolved" to only be sent if an "active" is sent first.

Here is the log:

time="2017-06-21T09:29:15Z" level=info msg="Starting alertmanager (version=0.7.1, branch=master, revision=ab4138299b94c78dc554ea96e2ab28d04b048059)" source="main.go:109"
time="2017-06-21T09:29:16Z" level=info msg="Build context (go=go1.8.3, user=root@97e9539a4c3f, date=20170609-15:31:09)" source="main.go:110"
time="2017-06-21T09:29:16Z" level=info msg="Loading configuration file" file="/etc/alertmanager/config.yml" source="main.go:234"
time="2017-06-21T09:29:16Z" level=info msg="Listening on :9093" source="main.go:308"
time="2017-06-21T09:34:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][active] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:35:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][resolved] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:35:37Z" level=debug msg="flushing [web_endpoint_is_down[5fe91a8][resolved]]" aggrGroup={}:{alertname="web_endpoint_is_down"} source="dispatch.go:425"
time="2017-06-21T09:36:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][resolved] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:36:37Z" level=debug msg="flushing [web_endpoint_is_down[5fe91a8][resolved]]" aggrGroup={}:{alertname="web_endpoint_is_down"} source="dispatch.go:425"

Configuration:

route:
  group_by: ['alertname', 'service']
  group_wait: 1m
  group_interval: 15m
  repeat_interval: 6h
  receiver: 'axinom-slack'

receivers:
- name: 'axinom-slack'
  slack_configs:
  - api_url: "xxx"
    channel: "#service-status"
    send_resolved: true
    text: "Active alerts: {{ range .Alerts }}{{if eq .Status \"firing\" }}{{ .Labels.job }} {{ end }}{{ end }}\n{{ range .CommonAnnotations.SortedPairs }}{{ .Name }} = {{ .Value }}\n{{ end }}"

Prometheus main configuration:

global:
  scrape_interval: 1m
  scrape_timeout: 30s

scrape_configs:
  - job_name: 'xxx'
    metrics_path: /probe
    params:
      module: ['basic_http_get']
      target: ['xxx']
    static_configs:
      - targets: ['yyy:9115']

Prometheus alerts configuration:

ALERT web_endpoint_is_down
    IF probe_success < 1.0
    FOR 3m

componennotify kinbug prioritPmaybe

Source

sandersaares

👍2

All 19 comments

What version of AM are you using?

fabxc on 22 Jun 2017

version=0.7.1, branch=master, revision=ab4138299b94c78dc554ea96e2ab28d04b048059

sandersaares on 22 Jun 2017

We also get this, with the following alert:

ALERT HighCPUUsage
  IF (100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[30s])))) > 90
  LABELS {
    severity = "Notice",
  }
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} CPU usage is dangerously high",
    description = "This device's cpu usage has exceeded the threshold with a value of {{ $value }}."
  }

For this one we get "RESOLVED" notifications once in a while, even if we didn't get the initial ones.

We're also on 0.7.1, running in Kubernetes 1.5.7, Prometheus 1.7.1.

DonMartin76 on 7 Jul 2017

We're seeing this too on v0.7.0.

domleb on 7 Sep 2017

Also happens in our setup with v0.9.1

agaoglu on 25 Oct 2017

Question for the group: does this only happen with a particular alert? So far I'm only seeing it with an alert such as:

ALERT StatusCode500IncreaseProd
  IF sum(increase(http_request_duration_seconds_count{env="prod", status_code="500"}[2m])) by (instance, app, dc, env, stack) / sum(increase(http_request_duration_seconds_count{env="prod"}[2m])) by (instance, app, dc, env, stack) > 0.02
  LABELS { severity = "critical" }
  ANNOTATIONS {
    summary = "500 status codes detected",
    description = "Increase in 500s (> 2% total traffic) observed over last 2m",
    runbook = "StatusCode500Increase"
  }

Other alerts seem to be fine, so I'm wondering if there is a correlation in the actual alert here.

Running 0.9.1 with mesh-mode, receiving alerts from multiple prometheus instances.

Edit: Nevermind, seeing this on a few different alerts.

alkalinecoffee on 1 Nov 2017

I'm also having this issue: My alert config file is:

global:
resolve_timeout: 5m
route:
repeat_interval: 1m
group_interval: 30s
group_wait: 0s
group_by: [ instance ]
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- send_resolved: true
username: 'Alert'
icon_emoji: 'fire'
channel: 'masaje-criollo'
api_url: 'https://hooks.slack.com/services/...'

And my alert.rules:
ALERT Microservices
IF probe_success{job="microservices"} == 0
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "{{ $labels.instance }} is down",
description = "{{ $labels.instance }} of job {{ $labels.job }} is under high load."
}

cepefernando on 14 Nov 2017

Seeing the same on the latest 0.10.0.

Can we get some input from the alertmanager folks on what we can do to help push this along? It seems that most of our alerts now are being sent as resolved without the accompanying firing alert.

log.level=debug doesn't appear to output anything helpful here.

alkalinecoffee on 17 Nov 2017

@alkalinecoffee Do you have a snippet of the log when this problem happened?

josedonizetti on 17 Nov 2017

@josedonizetti

level=info ts=2017-11-20T21:01:11.802785124Z caller=main.go:155 msg="Starting Alertmanager" version="(version=0.11.0, branch=HEAD, revision=30dd0426c08b6479d9a26259ea5efd63bc1ee273)"
level=info ts=2017-11-20T21:01:11.802858747Z caller=main.go:156 build_context="(go=go1.9.2, user=root@3e103e3fc918, date=20171116-17:45:26)"
level=info ts=2017-11-20T21:01:11.803691405Z caller=main.go:293 msg="Loading configuration file" file=simple.yml
level=info ts=2017-11-20T21:01:11.808676357Z caller=main.go:368 msg=Listening address=:9093
level=debug ts=2017-11-20T21:01:27.807339191Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=TestAlert[a3e06bb][active]
level=debug ts=2017-11-20T21:01:37.807322937Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=TestAlert[a3e06bb][resolved]
level=debug ts=2017-11-20T21:01:37.807447281Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"TestAlert\"}" msg=Flushing alerts=[TestAlert[a3e06bb][active]]
level=debug ts=2017-11-20T21:01:37.807689108Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"TestAlert\"}" msg=Flushing alerts=[TestAlert[a3e06bb][resolved]]

I do think I found a pattern here.

I think what's happening is the alert fires, and before the alert is actually sent out, the resolved message comes in very soon after (under a second, almost simultaenously), and that's what is sent out instead. This reflects what's seen in the original post here.

So for example, I have an alert for high memory usage over a 3m period. I have an application that sees memory spikes for about 3 minutes at a time. My alertmanager config also has group_wait: 2m. So by the time the alert threshold (3m) + group_wait (2m) periods have expired, the alert is already resolved before the alert firing message is actually sent out. I hope this makes sense.

It seems to me that if the alert fires and is resolved between 3rd and 5th minute, the firing alert is never sent out (which makes sense), but the resolved alert may be sent out (which doesn't make sense).

I'm able to pretty reliably replicate the issue and will work on a setup for others to easily replicate for further debugging.

alkalinecoffee on 17 Nov 2017

I confirm that my experiences as I remember them seem to fit within the parameters described in the comment above.

sandersaares on 17 Nov 2017

@alkalinecoffee Any progress or results on this? I'm also getting duplicate resolved alerts with no active firing ones. I'm on Alertmanager v0.12.0. Here's my config:

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: slack

nsaud01 on 11 Jan 2018

I haven't been able to duplicate it reliably locally, and I haven't looked at it since before the holidays.

From what I can tell, the firing and resolved alerts are sent immediately after the other, and alertmanager appears to only honor the resolved one.

We've since adjusted our alerts with longer time period thresholds, so we just don't see this issue much anymore. It'll be more prevalent if your alerts have low thresholds and fire often.

The important thing is that alerts don't appear to be outright dropped--they are fired and then resolved just as fast.

alkalinecoffee on 11 Jan 2018

I can explain why this happens sometimes.

Assuming this AlertManager configuration with a single route:

[...]
route:
  group_by: [job]
  group_wait: 10s
  group_interval: 10s
  receiver: 'web.hook'
[...]

Here is the workflow:

Prometheus fires an alert with labels {alertname="HighCPU",instance="foo",job="node"} (_alertA_).
After group_wait, AlertManager notifies the receiver and adds an entry to the notification log (nflog). The nflog entry stores the label sets as hash values for the firing/resolved alerts, the hashes are computed at the Dedup stage (here).
After a while, Prometheus resolves the alert.
Wait a bit, AlertManager notifies the receiver and updates the corresponding entry in the nflog.
Then Prometheus fires the same alert but with for another instance {alertname="HighCPU",instance="bar",job="node"} (_alertB_).
Before group_wait has elapsed, Prometheus resolves the alert.
When AlertManager checks whether a notification should be sent, it compares the current firing/resolved alerts with the hashes from the nflog entry. Since _alertB_ hasn't been notified yet, the nflog entry still references the label set of _alertA_ thus AlertManager notifies the receiver.

Now if the configuration is changed to include the instance label:

route:
  group_by: [job, instance]
  group_wait: 10s
  group_interval: 10s
  receiver: 'web.hook'

The duplication doesn't happen since _alertA_ and _alertB_ don't fall in the same aggregation group.

simonpasquier on 12 Jan 2018

@simonpasquier That's interesting. I'll change my group_by value to include instance and see if that fixes the issue. Thanks!

nsaud01 on 12 Jan 2018

I understand it's a workaround, but be aware that adding instance to group_by would cause multiple alerts to be fired (one for each instance) when something goes wrong.

alkalinecoffee on 12 Jan 2018

😕1 👎1

Well that's no good. But it also makes sense the more I think about it. @simonpasquier Is that your experience with that workaround?

nsaud01 on 12 Jan 2018

@nsaud01 @alkalinecoffee yes, if you want to group CPU alerts for all instances in a single notification then you can't use the workaround...

simonpasquier on 12 Jan 2018

I'll try to send a PR this week.

simonpasquier on 15 Jan 2018

Was this page helpful?

0 / 5 - 0 ratings