Flap detection and associated threshold would be a welcome addition to avoid potentially excessive alerting.
Here's the Nagios description of their implementation:
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/flapping.html
Alertmanager handles deduplication and dispatching of alerts.
What triggered and alert and when is subject to external components such as Prometheus.
Frequent state change of an alert can be prevented by using a fitting alerting expression (thresholds, predict_linear, ...) and the FOR clause.
I did consider this, however this seems to me to be a purely alerting-related concern.
On the recording side, we generally want to record actual state. We could potentially store the same data twice, once for actual state, and once with suppressed flapping to feed to alertmanager, but that's not efficient, and as you point out, other systems may also feed alertmanager - that case would require building flap detection into multiple other systems.
The problem with the threshold + FOR suggestion, for example, is that if there is a sequence like: >threshold FOR n -> <threshold momentarily -> >threshold FOR n. As the window rolls, you'll trip through alerting, resolved, alerting again. predict_linear (which is new to me - we're not on 0.16 in prod yet) can help here, but only for certain types of values.
This is an alerting concern, and thus it belongs in Prometheus as that's where our alerting it done. All the information and logic that you need is there to determine if a given set of metrics requires alerting on.
Unlike Nagios, Promethus alerts aren't based on checks. They're based on metrics which allow for much richer alerts that tend to flap a lot less. If you do have something check-like in Prometheus with a 0/1 value a mix of FOR, changes and mean_over_time will allow you to handle this. For richer alerts, if it's firing too often then you usually need to bump the threshold to improve the signal to noise ratio.
Alertmanager only gets involved once an alert is firing, and at that point it's about converting alerts into notifications. The only feature requests that would be appropriate in this area on the AM side would be around the alertmanager's repeat_interval/group_interval semantics not handling such flapping alerts well. As far as I can tell, our current logic is about as good as it can be in that area.
@brian-brazil Is there an example somewhere that shows how you combine FOR, changes and mean_over_time (I assume this is now avg_over_time) to accurately detect flapping?
If you have something like:
vector(time()) % 60 < 30
and a rule that says:
ALERT TimeBasedTestAlert
IF vector(time()) % 60 < 30
FOR 2m
I dont think there is a way to say "if this is in pending all the time, then trigger".
EG: Im trying to detect a docker service task constantly being killed or dying.
I also encountered this problem and find a resolation here https://stackoverflow.com/questions/45213745/prometheus-how-to-calculate-proportion-of-single-value-over-time
It would be a good idea to provide a new func to meet this demand.
ISTM that for: does a good job of debouncing alerts when they briefly trigger, but not if a firing alert briefly clears.
Suggestion: why not have alerts go through an additional state before returning to 'ok'?
(ok) -> pending -> firing -> resolving -> (ok)
Then have another keyword which defines how long the alert has to be in "resolving" state before it goes back to "ok", e.g.
expr: ifOperStatus != 1
for: 1m
until: 30s # bikeshed: choose a better name
Alerts would be sent in both "firing" and "resolving" states. In the "resolving" state, if the expr triggers then it goes back to "firing". But if it remains clear for the until: time, then state goes back to "ok".
It seems simple enough to me, and I like the symmetry.
Current:
T@for
T /alert F
(ok) -----> (pending) -----> (firing) -----> (ok)
^_/ <----' ^_/ ^_/
F F T T/alert
Proposed:
T@for
T /alert F/alert F@until
(ok) -----> (pending) -----> (firing) -----> (resolving) ------> (ok)
^_/ <----' ^_/ ^_/ <----' ^_/
F F T T/alert T/alert F/alert
FWIW, this problem is biting me at the moment. I have a "probe_success" metric which can be 1 or 0, and an alerting rule like this:
- name: ReverseDNSRules
rules:
- alert: DNSQueryFailing
expr: max_over_time(probe_success{job="revdns"}[5m]) != 1
for: 2m
This works well if the service is mostly "success", with occasional "fail". Those get muted well.
However if the service is mostly "fail", with occasional "success", then it's very noisy.
For now I will try changing to avg_over_time < 0.5, which I expect will be better as long as it's not hovering around 50% of checks working.
Adding hysteresis would be better, e.g.
But I can't see an obvious way to do that, and it might require something nasty like creating recording rules with new timeseries.
EDIT: unfortunately the average failure rate turns out to be 50% after all :-(
I have tweaked to match, but this is not ideal due to the lag in detecting a problem:
expr: avg_over_time(probe_success{job="revdns"}[30m]) < 0.8
for: 10m
It occurs to me there's another way this could be handled: use one expr when the alert is not firing, and a different one when it is. e.g.
expr: avg_over_time(probe_success{job="revdns"}[10m]) < 0.7
expr_firing: avg_over_time(probe_success{job="revdns"}[10m]) < 0.9
That is: the alert fires when the success average over 10 minutes falls below 70%, and stops firing when the average goes over 90%. [^1]
Having a resolving state analogous to pending is orthogonal to this.
I suggest the alert not be triggered until both expr_start and expr_continue are true: this makes things work better if the expressions are not exactly aligned, e.g. expr_continue averages over a different time range.
So another way of thinking about this is that expr_continue is like today's expr, and expr_start is an additional condition which must be true to go from Pending to Firing.
Most helpful comment
@brian-brazil Is there an example somewhere that shows how you combine
FOR,changesandmean_over_time(I assume this is nowavg_over_time) to accurately detect flapping?