I have alertmanager sending notifications to Slack. Today it sent a "resolved" notification to Slack without ever sending an "active" notification. This is contrary to my expectations - I expect "resolved" to only be sent if an "active" is sent first.
Here is the log:
time="2017-06-21T09:29:15Z" level=info msg="Starting alertmanager (version=0.7.1, branch=master, revision=ab4138299b94c78dc554ea96e2ab28d04b048059)" source="main.go:109"
time="2017-06-21T09:29:16Z" level=info msg="Build context (go=go1.8.3, user=root@97e9539a4c3f, date=20170609-15:31:09)" source="main.go:110"
time="2017-06-21T09:29:16Z" level=info msg="Loading configuration file" file="/etc/alertmanager/config.yml" source="main.go:234"
time="2017-06-21T09:29:16Z" level=info msg="Listening on :9093" source="main.go:308"
time="2017-06-21T09:34:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][active] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:35:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][resolved] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:35:37Z" level=debug msg="flushing [web_endpoint_is_down[5fe91a8][resolved]]" aggrGroup={}:{alertname="web_endpoint_is_down"} source="dispatch.go:425"
time="2017-06-21T09:36:37Z" level=debug msg="Received alert" alert=web_endpoint_is_down[5fe91a8][resolved] component=dispatcher source="dispatch.go:183"
time="2017-06-21T09:36:37Z" level=debug msg="flushing [web_endpoint_is_down[5fe91a8][resolved]]" aggrGroup={}:{alertname="web_endpoint_is_down"} source="dispatch.go:425"
Configuration:
route:
group_by: ['alertname', 'service']
group_wait: 1m
group_interval: 15m
repeat_interval: 6h
receiver: 'axinom-slack'
receivers:
- name: 'axinom-slack'
slack_configs:
- api_url: "xxx"
channel: "#service-status"
send_resolved: true
text: "Active alerts: {{ range .Alerts }}{{if eq .Status \"firing\" }}{{ .Labels.job }} {{ end }}{{ end }}\n{{ range .CommonAnnotations.SortedPairs }}{{ .Name }} = {{ .Value }}\n{{ end }}"
Prometheus main configuration:
global:
scrape_interval: 1m
scrape_timeout: 30s
scrape_configs:
- job_name: 'xxx'
metrics_path: /probe
params:
module: ['basic_http_get']
target: ['xxx']
static_configs:
- targets: ['yyy:9115']
Prometheus alerts configuration:
ALERT web_endpoint_is_down
IF probe_success < 1.0
FOR 3m
What version of AM are you using?
version=0.7.1, branch=master, revision=ab4138299b94c78dc554ea96e2ab28d04b048059
We also get this, with the following alert:
ALERT HighCPUUsage
IF (100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[30s])))) > 90
LABELS {
severity = "Notice",
}
ANNOTATIONS {
summary = "Instance {{ $labels.instance }} CPU usage is dangerously high",
description = "This device's cpu usage has exceeded the threshold with a value of {{ $value }}."
}
For this one we get "RESOLVED" notifications once in a while, even if we didn't get the initial ones.
We're also on 0.7.1, running in Kubernetes 1.5.7, Prometheus 1.7.1.
We're seeing this too on v0.7.0.
Also happens in our setup with v0.9.1
Question for the group: does this only happen with a particular alert? So far I'm only seeing it with an alert such as:
ALERT StatusCode500IncreaseProd
IF sum(increase(http_request_duration_seconds_count{env="prod", status_code="500"}[2m])) by (instance, app, dc, env, stack) / sum(increase(http_request_duration_seconds_count{env="prod"}[2m])) by (instance, app, dc, env, stack) > 0.02
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "500 status codes detected",
description = "Increase in 500s (> 2% total traffic) observed over last 2m",
runbook = "StatusCode500Increase"
}
Other alerts seem to be fine, so I'm wondering if there is a correlation in the actual alert here.
Running 0.9.1 with mesh-mode, receiving alerts from multiple prometheus instances.
Edit: Nevermind, seeing this on a few different alerts.
I'm also having this issue: My alert config file is:
global:
resolve_timeout: 5m
route:
repeat_interval: 1m
group_interval: 30s
group_wait: 0s
group_by: [ instance ]
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- send_resolved: true
username: 'Alert'
icon_emoji: 'fire'
channel: 'masaje-criollo'
api_url: 'https://hooks.slack.com/services/...'
And my alert.rules:
ALERT Microservices
IF probe_success{job="microservices"} == 0
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "{{ $labels.instance }} is down",
description = "{{ $labels.instance }} of job {{ $labels.job }} is under high load."
}
Seeing the same on the latest 0.10.0.
Can we get some input from the alertmanager folks on what we can do to help push this along? It seems that most of our alerts now are being sent as resolved without the accompanying firing alert.
log.level=debug doesn't appear to output anything helpful here.
@alkalinecoffee Do you have a snippet of the log when this problem happened?
@josedonizetti
level=info ts=2017-11-20T21:01:11.802785124Z caller=main.go:155 msg="Starting Alertmanager" version="(version=0.11.0, branch=HEAD, revision=30dd0426c08b6479d9a26259ea5efd63bc1ee273)"
level=info ts=2017-11-20T21:01:11.802858747Z caller=main.go:156 build_context="(go=go1.9.2, user=root@3e103e3fc918, date=20171116-17:45:26)"
level=info ts=2017-11-20T21:01:11.803691405Z caller=main.go:293 msg="Loading configuration file" file=simple.yml
level=info ts=2017-11-20T21:01:11.808676357Z caller=main.go:368 msg=Listening address=:9093
level=debug ts=2017-11-20T21:01:27.807339191Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=TestAlert[a3e06bb][active]
level=debug ts=2017-11-20T21:01:37.807322937Z caller=dispatch.go:188 component=dispatcher msg="Received alert" alert=TestAlert[a3e06bb][resolved]
level=debug ts=2017-11-20T21:01:37.807447281Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"TestAlert\"}" msg=Flushing alerts=[TestAlert[a3e06bb][active]]
level=debug ts=2017-11-20T21:01:37.807689108Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{alertname=\"TestAlert\"}" msg=Flushing alerts=[TestAlert[a3e06bb][resolved]]
I do think I found a pattern here.
I think what's happening is the alert fires, and before the alert is actually sent out, the resolved message comes in very soon after (under a second, almost simultaenously), and that's what is sent out instead. This reflects what's seen in the original post here.
So for example, I have an alert for high memory usage over a 3m period. I have an application that sees memory spikes for about 3 minutes at a time. My alertmanager config also has group_wait: 2m. So by the time the alert threshold (3m) + group_wait (2m) periods have expired, the alert is already resolved before the alert firing message is actually sent out. I hope this makes sense.
It seems to me that if the alert fires and is resolved between 3rd and 5th minute, the firing alert is never sent out (which makes sense), but the resolved alert may be sent out (which doesn't make sense).
I'm able to pretty reliably replicate the issue and will work on a setup for others to easily replicate for further debugging.
I confirm that my experiences as I remember them seem to fit within the parameters described in the comment above.
@alkalinecoffee Any progress or results on this? I'm also getting duplicate resolved alerts with no active firing ones. I'm on Alertmanager v0.12.0. Here's my config:
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: slack
I haven't been able to duplicate it reliably locally, and I haven't looked at it since before the holidays.
From what I can tell, the firing and resolved alerts are sent immediately after the other, and alertmanager appears to only honor the resolved one.
We've since adjusted our alerts with longer time period thresholds, so we just don't see this issue much anymore. It'll be more prevalent if your alerts have low thresholds and fire often.
The important thing is that alerts don't appear to be outright dropped--they are fired and then resolved just as fast.
I can explain why this happens sometimes.
Assuming this AlertManager configuration with a single route:
[...]
route:
group_by: [job]
group_wait: 10s
group_interval: 10s
receiver: 'web.hook'
[...]
Here is the workflow:
{alertname="HighCPU",instance="foo",job="node"} (_alertA_).group_wait, AlertManager notifies the receiver and adds an entry to the notification log (nflog). The nflog entry stores the label sets as hash values for the firing/resolved alerts, the hashes are computed at the Dedup stage (here).{alertname="HighCPU",instance="bar",job="node"} (_alertB_).group_wait has elapsed, Prometheus resolves the alert.Now if the configuration is changed to include the instance label:
route:
group_by: [job, instance]
group_wait: 10s
group_interval: 10s
receiver: 'web.hook'
The duplication doesn't happen since _alertA_ and _alertB_ don't fall in the same aggregation group.
@simonpasquier That's interesting. I'll change my group_by value to include instance and see if that fixes the issue. Thanks!
I understand it's a workaround, but be aware that adding instance to group_by would cause multiple alerts to be fired (one for each instance) when something goes wrong.
Well that's no good. But it also makes sense the more I think about it. @simonpasquier Is that your experience with that workaround?
@nsaud01 @alkalinecoffee yes, if you want to group CPU alerts for all instances in a single notification then you can't use the workaround...
I'll try to send a PR this week.