I have a bunch of alerts in the form of Prefix_Some_Name. I've set up the following inhibition rule:
- source_match_re:
alertname: Prefix_.+
target_match:
alertname: Prefix_Failsafe_Alert
equal:
- label1
- label2
As far as I can tell, this alert will inhibit itself, and never be delivered. Depending on how you look at it this is a bug in the alertmanager, or an important gotcha that should be documented.
Happy to contribute a patch either way.
Edit: I should add that I empirically verified this against an older alertmanager. I looked through the current master branch, and AFAICT the behaviour is still the same.
As far as I can tell, this alert will inhibit itself
Hm yes, that is the behavior I would expect. I guess you expected that alerts generally cannot inhibit themselves?
I think adding a clarfiying sentence to https://github.com/prometheus/docs/blob/master/content/docs/alerting/configuration.md#inhibit_rule would be generally welcome if you think this is something that is confusing.
I also discovered this behavior the hard way. I agree that it's strictly reasonable within the context of how inhibitions are currently structured, but is this behavior considered desirable?
If alerts never inhibited themselves, certain types of alerting noise could be reduced in a way that's currently infeasible. The only use case I can imagine for self-inhibiting alerts is ensuring they never notify anybody, which would surely be better handled by configuring a matching route with no receivers. But maybe I'm just not thinking big enough?
Yeah, I can't come up with a reason why an alert should be able to inhibit itself either. The question is, is it worth introducing an explicit exception for this case? That's ultimately up to the Alertmanager maintainers (@stuartnelson3 @fabxc).
I think the general expectation here is that there's some other label you can work off to distinguish these, such as a "dependency" label or similar.
The "Prefix_" being the same on both is also a bit of a smell, that should probably be a label.
@brian-brazil that might be correct, but doesn't change the underlying issue.
This should probably a documented caveat. I think it doesn't really happen for typical inhibition usage as there should be some clearly distinguishing labels. Probably an edge case not worth complicating implementation for.
@lmb Would you mind updating the documentation as suggested by @juliusv in his comment?
Yup can do, feel free to assign to me.
@lmb Thanks! Just tag me on the PR. Let us know if you need any help.
The problem here is that the most obvious fix requires negative matches, which the inhibit rule syntax does not support. E.g. you have a NodeDown page that should inhibit _all_ other pages that share the same node label (because you know the node is down so you don't need to be notified that your HAProxy on that node is down and your MySQL and your Prometheus etc.).
Naive rule syntax that inhibits the original NodeDown alert, too:
inhibit_rules:
- source_match:
alertname: NodeDown
severity: critical
target_match:
severity: critical
equal:
- node
This works:
inhibit_rules:
- source_match:
alertname: NodeDown
severity: critical
target_match:
severity: critical
target_match_re:
alertname: '.{0,7}|.{9,}|([^N]|.([^o]|.([^d]|.([^e]|.([^D]|.([^o]|.([^w]|.[^n]))))))).*'
equal:
- node
Much nicer would be:
inhibit_rules:
- source_match:
alertname: NodeDown
severity: critical
target_match:
severity: critical
target_not_match:
alertname: NodeDown
equal:
- node
Well, since I seem to have wasted a couple of hours implementing a basic "alert shouldn't inhibit itself" filter that's not wanted, does anyone but @beorn7 think that having something along the lines of his suggested target_not_match is a good idea?
I don't think it's a good idea. I think the issues presented here are merely the first level issues with this approach, and I believe that going down that route would lead to very fragile and tricky to maintain configs as the second and third level issues come along (think a multi-level set of nested inhibitions).
Using alertnames in the alertmanager config is a smell, severity etc. labels are a better and more maintainable approach.
I can believe that a working "an alert cannot inhibit itself" implementation might turn out tricky.
I think we should either add negative matchers as suggested above, or we could adopt the syntax as already used in the Alertmanager UI (=, !=, =~, !~), which would also allow us to remove the separate ..._re config entries. If we named that differently, e.g. target_matchers and source_matchers, we could have both system in parallel for a while so that we don't break existing configs.
Same would be applicable to the routing tree.
I vaguely remember @fabxc suggested something like that in IRC.
FYI, the documentation change in prometheus/docs#812 is already live on prometheus.io, so this issue should probably be closed.
Indeed, discussion of potential changes to routing&inhibition syntax can go in a new issue.
I have filed #1023 for extended/alternative matcher syntax.
About the original problem of this issue: So far, there have been zero reported use-cases for an alert actually required to inhibit itself, while there are multiple where it is awkward that an alert inhibits itself and requires work-arounds of varying weirdness, including a number where the issue was discovered the hard way (even if the semantics was well known in principle, i.e. it is not just a documentation issue, it's also a problem of counter-intuition). Thus, if there actually _is_ a sane implementation of preventing alerts from inhibiting themselves, I'd prefer if we used it. It would be great if those maintaining and knowing the Alertmanager codebase could have a look at #1017 and make a call if this is a possible way at all.
FYI, #1017 has just been merged, solving the basic case of a single inhibition rule possibly inhibiting the source alert(s). Now, if an alert matches both the source and target matchers of a inhibition rule it will not be inhibited __by that rule__.
Multiple inhibition rules (e.g. {_A_ inhibits _B_} and {_B_ inhibits _A_}) will still cause unwanted effects, but that is outside of the scope of this issue.
BTW, the example in the OP will now be a no-op because all possible target matches will also match the source.
So yeah, in the case that @lmb presented, we are now encountering a different but still surprising effect, as the whole rule doesn't apply anymore. That's happening if a rule is matching a lot of source alerts (while the usual inhibition rule is probably quite specific on the source side and wider on the target side).
I guess the intuitively desired behavior is that the alert would not inhibit itself, but another alert, that matches the source but _not_ the target would inhibit an alert that matches _both_ source and target.
I think #1017 improves things because surprisingly getting alerted is still better than surprisingly _not_ getting alerted, but we should decide if we want to refine the semantics before we do the next release.
The original report is not resolved, but it's the only one I've seen trying to do that. All the other cases I've seen so far should be handled by #1017.
I guess the intuitively desired behavior is that the alert would not inhibit itself, but another alert, that matches the source but not the target would inhibit an alert that matches both source and target.
I can't quite get my head around that. Do you have an example?
I suspect such logic is too intricate to be understandable in practice. #1017 is understandable, if a tad more complicated than the previous logic.
I guess the intuitively desired behavior is that the alert would not inhibit itself, but another alert, that matches the source but not the target would inhibit an alert that matches both source and target.
I can't quite get my head around that. Do you have an example?
Yeah, it's kind of weird. I think the example is exactly the one @lmb gave when opening this issue. I can speculate about the exact use case around it, but perhaps @lmb can explain himself. (My tl;dr take: If Prefix_Failsafe_Alert fires on its own, it should notify. But if any other Prefix_... alert is firing too, Prefix_Failsafe_Alert should be inhibited, presumably because the Prefix_Failsafe_Alert isn't helpful anymore if you also get one of the other Prefix_... alerts.)
Overview of semantics tried so far with a new suggestion of mine:
_Pro:_ Clear semantics.
_Con:_ Makes some use cases difficult to implement in practice.
_Pro:_ Still fairly clear semantics..
_Con:_ If there are multiple alerts that match both source and target side, then they will cross-inhibit each other, which is almost as problematic as the original semantics.
_Pro:_ Easy to implement. Solves most of the previously problematic cases.
_Con:_ With a widely matching source side, many alerts will never get inhibited.
Difference to current semantics: Alerts that matches both target and source side _can_ be inhibited by alerts that only match the source side but not the target side.
_Pro:_ Matches intuition (arguably…)
_Con:_ Implementation might be hard. Difficult to reason with.
I'd like to note that in the “clean” case where alerts matching the source side and alerts matching the target side have no intersection, any of the different semantics above will yield the same results. Thus, we should recommend that in any case as a best practice to avoid the difficult terrain.
That's a good summary, and your suggestion sounds at first glance as a logical step if we were to go further down the road of making this more sophisticated. For now I'd suggest giving the new logic a month or two to bake in, and seeing if our current state is good enough.
Sounds good. I might also find the time to look at implementation to see how hard or easy it really is.
Bar 1-2 users using inhibits for things they shouldn't have been, this seems to have worked out fine and confusion in this area has gone away. I think we can close this.
This is still on my list to try out. The asymmetry in the current implementation worries me. If the implementation of my final suggestion above turns out to be clean and straight-forward, I'd suggest to go for it.
I have assigned this to myself and would like to keep it open.
As a bit of an aside, the AM config docs (https://prometheus.io/docs/alerting/configuration/#
Yes, that's out of date.
That should be solved for good now. https://github.com/prometheus/docs/pull/1297 to be merged once released.