Alertmanager: Alerts can inhibit themselves

Created on 23 Mar 2017 · 28Comments · Source: prometheus/alertmanager

I have a bunch of alerts in the form of Prefix_Some_Name. I've set up the following inhibition rule:

- source_match_re:
    alertname: Prefix_.+
  target_match:
    alertname: Prefix_Failsafe_Alert
  equal:
    - label1
    - label2

As far as I can tell, this alert will inhibit itself, and never be delivered. Depending on how you look at it this is a bug in the alertmanager, or an important gotcha that should be documented.

Happy to contribute a patch either way.

Edit: I should add that I empirically verified this against an older alertmanager. I looked through the current master branch, and AFAICT the behaviour is still the same.

areusability componennotify help wanted kinbug low hanging fruit

Source

lmb

👍1

All 28 comments

As far as I can tell, this alert will inhibit itself

Hm yes, that is the behavior I would expect. I guess you expected that alerts generally cannot inhibit themselves?

I think adding a clarfiying sentence to https://github.com/prometheus/docs/blob/master/content/docs/alerting/configuration.md#inhibit_rule would be generally welcome if you think this is something that is confusing.

juliusv on 23 Mar 2017

❤1

I also discovered this behavior the hard way. I agree that it's strictly reasonable within the context of how inhibitions are currently structured, but is this behavior considered desirable?

If alerts never inhibited themselves, certain types of alerting noise could be reduced in a way that's currently infeasible. The only use case I can imagine for self-inhibiting alerts is ensuring they never notify anybody, which would surely be better handled by configuring a matching route with no receivers. But maybe I'm just not thinking big enough?

threemachines on 23 Mar 2017

Yeah, I can't come up with a reason why an alert should be able to inhibit itself either. The question is, is it worth introducing an explicit exception for this case? That's ultimately up to the Alertmanager maintainers (@stuartnelson3 @fabxc).

juliusv on 23 Mar 2017

I think the general expectation here is that there's some other label you can work off to distinguish these, such as a "dependency" label or similar.

The "Prefix_" being the same on both is also a bit of a smell, that should probably be a label.

brian-brazil on 23 Mar 2017

@brian-brazil that might be correct, but doesn't change the underlying issue.

lmb on 24 Mar 2017

This should probably a documented caveat. I think it doesn't really happen for typical inhibition usage as there should be some clearly distinguishing labels. Probably an edge case not worth complicating implementation for.

fabxc on 21 Apr 2017

@lmb Would you mind updating the documentation as suggested by @juliusv in his comment?

mxinden on 13 Aug 2017

Yup can do, feel free to assign to me.

lmb on 14 Aug 2017

@lmb Thanks! Just tag me on the PR. Let us know if you need any help.

mxinden on 14 Aug 2017

The problem here is that the most obvious fix requires negative matches, which the inhibit rule syntax does not support. E.g. you have a NodeDown page that should inhibit _all_ other pages that share the same node label (because you know the node is down so you don't need to be notified that your HAProxy on that node is down and your MySQL and your Prometheus etc.).

Naive rule syntax that inhibits the original NodeDown alert, too:

inhibit_rules:
- source_match:
    alertname: NodeDown
    severity: critical
  target_match:
    severity: critical
  equal:
  - node

This works:

inhibit_rules:
- source_match:
    alertname: NodeDown
    severity: critical
  target_match:
    severity: critical
  target_match_re:
    alertname: '.{0,7}|.{9,}|([^N]|.([^o]|.([^d]|.([^e]|.([^D]|.([^o]|.([^w]|.[^n]))))))).*'
  equal:
  - node

Much nicer would be:

inhibit_rules:
- source_match:
    alertname: NodeDown
    severity: critical
  target_match:
    severity: critical
  target_not_match:
    alertname: NodeDown
  equal:
  - node

beorn7 on 19 Sep 2017

Well, since I seem to have wasted a couple of hours implementing a basic "alert shouldn't inhibit itself" filter that's not wanted, does anyone but @beorn7 think that having something along the lines of his suggested target_not_match is a good idea?

alin-amana on 2 Oct 2017

I don't think it's a good idea. I think the issues presented here are merely the first level issues with this approach, and I believe that going down that route would lead to very fragile and tricky to maintain configs as the second and third level issues come along (think a multi-level set of nested inhibitions).

Using alertnames in the alertmanager config is a smell, severity etc. labels are a better and more maintainable approach.

brian-brazil on 2 Oct 2017

👍1

I can believe that a working "an alert cannot inhibit itself" implementation might turn out tricky.

I think we should either add negative matchers as suggested above, or we could adopt the syntax as already used in the Alertmanager UI (=, !=, =~, !~), which would also allow us to remove the separate ..._re config entries. If we named that differently, e.g. target_matchers and source_matchers, we could have both system in parallel for a while so that we don't break existing configs.

Same would be applicable to the routing tree.

I vaguely remember @fabxc suggested something like that in IRC.

beorn7 on 2 Oct 2017

FYI, the documentation change in prometheus/docs#812 is already live on prometheus.io, so this issue should probably be closed.

alin-amana on 4 Oct 2017

Indeed, discussion of potential changes to routing&inhibition syntax can go in a new issue.

brian-brazil on 4 Oct 2017

I have filed #1023 for extended/alternative matcher syntax.

About the original problem of this issue: So far, there have been zero reported use-cases for an alert actually required to inhibit itself, while there are multiple where it is awkward that an alert inhibits itself and requires work-arounds of varying weirdness, including a number where the issue was discovered the hard way (even if the semantics was well known in principle, i.e. it is not just a documentation issue, it's also a problem of counter-intuition). Thus, if there actually _is_ a sane implementation of preventing alerts from inhibiting themselves, I'd prefer if we used it. It would be great if those maintaining and knowing the Alertmanager codebase could have a look at #1017 and make a call if this is a possible way at all.

beorn7 on 4 Oct 2017

FYI, #1017 has just been merged, solving the basic case of a single inhibition rule possibly inhibiting the source alert(s). Now, if an alert matches both the source and target matchers of a inhibition rule it will not be inhibited __by that rule__.

Multiple inhibition rules (e.g. {_A_ inhibits _B_} and {_B_ inhibits _A_}) will still cause unwanted effects, but that is outside of the scope of this issue.

BTW, the example in the OP will now be a no-op because all possible target matches will also match the source.

alin-amana on 7 Nov 2017

So yeah, in the case that @lmb presented, we are now encountering a different but still surprising effect, as the whole rule doesn't apply anymore. That's happening if a rule is matching a lot of source alerts (while the usual inhibition rule is probably quite specific on the source side and wider on the target side).

I guess the intuitively desired behavior is that the alert would not inhibit itself, but another alert, that matches the source but _not_ the target would inhibit an alert that matches _both_ source and target.

I think #1017 improves things because surprisingly getting alerted is still better than surprisingly _not_ getting alerted, but we should decide if we want to refine the semantics before we do the next release.

beorn7 on 7 Nov 2017

The original report is not resolved, but it's the only one I've seen trying to do that. All the other cases I've seen so far should be handled by #1017.

I guess the intuitively desired behavior is that the alert would not inhibit itself, but another alert, that matches the source but not the target would inhibit an alert that matches both source and target.

I can't quite get my head around that. Do you have an example?

I suspect such logic is too intricate to be understandable in practice. #1017 is understandable, if a tad more complicated than the previous logic.

brian-brazil on 7 Nov 2017

I guess the intuitively desired behavior is that the alert would not inhibit itself, but another alert, that matches the source but not the target would inhibit an alert that matches both source and target.

I can't quite get my head around that. Do you have an example?

Yeah, it's kind of weird. I think the example is exactly the one @lmb gave when opening this issue. I can speculate about the exact use case around it, but perhaps @lmb can explain himself. (My tl;dr take: If Prefix_Failsafe_Alert fires on its own, it should notify. But if any other Prefix_... alert is firing too, Prefix_Failsafe_Alert should be inhibited, presumably because the Prefix_Failsafe_Alert isn't helpful anymore if you also get one of the other Prefix_... alerts.)

beorn7 on 7 Nov 2017

Overview of semantics tried so far with a new suggestion of mine:

Original semantics: If source and target side both match, the alert inhibits itself.

_Pro:_ Clear semantics.

_Con:_ Makes some use cases difficult to implement in practice.

First attempt (rejected): Any given alert cannot inhibit itself.

_Pro:_ Still fairly clear semantics..

_Con:_ If there are multiple alerts that match both source and target side, then they will cross-inhibit each other, which is almost as problematic as the original semantics.

Current semantics: An alert that matches both target and source side of a rule can never be inhibited by that rule.

_Pro:_ Easy to implement. Solves most of the previously problematic cases.

_Con:_ With a widely matching source side, many alerts will never get inhibited.

Next suggestion: An alert that matches both target and source side cannot inhibit alerts for which the same is true.

Difference to current semantics: Alerts that matches both target and source side _can_ be inhibited by alerts that only match the source side but not the target side.

_Pro:_ Matches intuition (arguably…)

_Con:_ Implementation might be hard. Difficult to reason with.

I'd like to note that in the “clean” case where alerts matching the source side and alerts matching the target side have no intersection, any of the different semantics above will yield the same results. Thus, we should recommend that in any case as a best practice to avoid the difficult terrain.

beorn7 on 7 Nov 2017

That's a good summary, and your suggestion sounds at first glance as a logical step if we were to go further down the road of making this more sophisticated. For now I'd suggest giving the new logic a month or two to bake in, and seeing if our current state is good enough.

brian-brazil on 7 Nov 2017

Sounds good. I might also find the time to look at implementation to see how hard or easy it really is.

beorn7 on 7 Nov 2017

Bar 1-2 users using inhibits for things they shouldn't have been, this seems to have worked out fine and confusion in this area has gone away. I think we can close this.

brian-brazil on 28 Mar 2018

This is still on my list to try out. The asymmetry in the current implementation worries me. If the implementation of my final suggestion above turns out to be clean and straight-forward, I'd suggest to go for it.
I have assigned this to myself and would like to keep it open.

beorn7 on 28 Mar 2018

As a bit of an aside, the AM config docs (https://prometheus.io/docs/alerting/configuration/#) still say "Alerts can inhibit themselves. Avoid writing inhibition rules where an alert matches both source and target.", which I believe is incorrect given #1017 has been merged. I'm presuming a doc-fix PR be accepted?