Kibana: [Meta] UI ability to assign alert actions per action group

Created on 21 Apr 2020 · 26Comments · Source: elastic/kibana

An alert can use one or many action groups to fire actions.
Currently the UI is limited to assign alert actions to defaultActionGroupId only.
This is a meta issue to support different action groups like the API already supports.

Individual Issues:

[x] Define actions under a specific Action Group in the Add / Edit flyouts https://github.com/elastic/kibana/issues/82274
[x] Include Action Group in Event Log when Instances are activated and display Action Group on Alert Instances in Alert Details page https://github.com/elastic/kibana/issues/82275
[x] Conditions fields for configurable Action Group params https://github.com/elastic/kibana/issues/82412

TBD

[x] Include the transition of an Alert Instance from one Action Group to another in the event log https://github.com/elastic/kibana/issues/82792

Alerting Meta Alerting Services enhancement

Source

mikecote

👍1

Most helpful comment

Here are the latest mockups for this: Per our recent discussion, I've added a badge to the collapsed state of the action that signifies to which action group it belongs. cc/ @gmmorris @mikecote

mdefazio on 2 Nov 2020

❤3

All 26 comments

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

elasticmachine on 21 Apr 2020

One of the questions that was raised when we previously looked at supporting multiple action groups, was how the parameters might be changed to deal with action groups. For example, say you have an threshold-styled alert type that has two action groups - warning and critical. Presumably you want two threshold values there. I believe we were trying to figure out how to put the relevant threshold picker inputs near the action group settings. Which then raised the question - what other parts of the alert parameters would change? Could you pick another index, or time field, or comparator? How would these elements be associated with the action groups?

Now thinking that trying to put the parameters near the action groups isn't right. There is only one set of config/parameters for an alert, they should be grouped together in terms of input controls. For the case of a threshold-style alert with two action groups, the simplest case is just to two threshold values - and they should probably be named relevant to the action group. So, warningThreshold and criticalThreshold. And since they are part of the complete set of parameters, they should be in the "parameter" section of the UI.

A more complicated example could be a threshold-style alert with two action groups that took different thresholds AND expressions. In that case, the parameters could have paired parameter values for the threshold, the comparator, the field compared, etc. Still all in the "parameter" section of the UI, and again named relevant to their actionGroup.

This makes the UI a bit wonky, with all the params in one section, and then the action groups in another section, but I'm not sure a pleasing UI is possible where you split these kind of "paired" parameter values so they are closer to the action group they related to. I think this is the simplest conceptual model though (all parameters together), so we should try a design like that to see if we think it's reasonable.

pmuellr on 21 Apr 2020

I'm wondering if we could convert the existing index threshold alert to allow multiple action groups, in a "progressive" manner. Eg, add an optional second threshold value and second action group. Seems do-able, kinda wondering how confusing it will be though ...

Some interesting UI considerations as well - will need potentially two threshold lines in the preview graph.

In general we need to figure out the constraints between comparators and thresholds. If comparator is >, then presumably threshold 1 > threshold 2. And if threshold 2 fires, then threshold 1 should not fire, I guess. For comparator <, I guess it's kinda reversed? And I what about the between comparator? Do you have an "outer" threshold and "inner" one? That doesn't seem like it makes sense, maybe only one threshold is available for between.

pmuellr on 21 Apr 2020

Below are some questions I keep running into as I try to create concepts for this.

Will there be certain parameters that we can create a 'group' off of? And others we cannot?
Will the group only be based on a single parameter, or can it be multiple parameters?
Am I required to setup separate actions for the group? Or is it also just as likely that I would simply want to run the same actions?
If I have 3 actions setup for the main group, and I add another group, can I choose to run only 1 of those actions?
How common is it to setup these groupings? Would this need to be an option for the solutions to show the option or not, so it would show for some types and not others?
Are there set types of groups? (Critical, Alert, Warning, ... ) And/or can the user create their own label for these? I know we also talked about acknowledgment (or 'Unacknowledged) as a possible group.

I know some of these overlap Patrick's comments, but was just outlining the ones I had

mdefazio on 23 Apr 2020

Will there be certain parameters that we can create a 'group' off of? And others we cannot?

This would be defined by the alert type, but it seems like it would be hard to allow the UI of an alert type to specify a set of parameters applicable to a subset of action groups. So it feels like for now, we shouldn't consider "grouping" the params around certain action groups - let's just leave them all in the same visual location.

Will the group only be based on a single parameter, or can it be multiple parameters?

Again, up to the alert type, but could certainly be multiple.

Am I required to setup separate actions for the group? Or is it also just as likely that I would simply want to run the same actions?

Yes, looking at it from the alerting point of view, the actions in each group are separate. Howevers, it seems likely that folks would want to define the same connector to run in multiple groups. Maybe in one condition (for one action group) you do X, but as the condition worsens you want to do an additional thing (for another action group). It would be nice to "copy" the connector from one group to another. Another way of looking at this is that you could have a list of all the connectors you wanted to run, across all action groups, and then for each connector you could indicate which action group it ran in. Except that doesn't quite work as the connectors in the action groups are ordered, so there wouldn't be a way to indicate the order. I think for now, the simplest thing in this case is to force the customer to create the same connector in all the action groups. We can optimize the UX for this later when we find out how this stuff is used in practice.

If I have 3 actions setup for the main group, and I add another group, can I choose to run only 1 of those actions?

Per my notes in 3. ^^^, the customer will have to add a new connector for that group, and they can choose whatever connectors they want, including connectors not in the main group. Every connector would be under a single action group, and an action group can have multiple connectors. Also, in case it wasn't clear, the alert type defines how many action groups it supports (it's static), and also indicates the "default" one to use, if there are multiple.

How common is it to setup these groupings? Would this need to be an option for the solutions to show the option or not, so it would show for some types and not others?

Not clear how common it will be for alerts to have multiple action groups, but as action groups are defined by the alert type, some will have 1, some will have multiple. We could optimize the case for an alert with a single action group defined, and not show any grouping, just as we do today. If the alert does support multiple action groups, then the area under the graph, where you list the actions, would now have a separate section for each action group (the default one first, and the remaining ones are ordered I believe), where actions can be added to each action group, just like today.

Are there set types of groups? (Critical, Alert, Warning, ... ) And/or can the user create their own label for these? I know we also talked about acknowledgment (or 'Unacknowledged) as a possible group.

Critical/Alert/Warning was the original reason for creating action groups. The user cannot add/change/rename them - they're defined by the alert type. Ack/unack is - I think - out of scope here. I can see how it could potentially fit in here, but there's a lot of other stuff to think about with acks. I've also been thinking about having "built in" action groups - action groups every alert would have - "resolved" and "no data" are the two that I've been thinking of. They're basically common behaviours/states an alert / alert instance could be in, that we'd like to handle generically, if possible. Jury is still out though, it's not coming in 7.8 anyway.

As a somewhat contrived example, but based on a real need, there's a new SSL certificate check alert. It will fire when a cert is going to expire within 30 days. The thought is to allow for additional actions once it's going to expire in 7 days. So, think two action groups "early warning" and "about to expire". As alert parameters, you'd be able to set the 30 and 7 values for those, in the usual parameter section. In the alert groups, you add an email action to "early warning", and an email AND slack action to "about to expire" - the idea being to bug them a little bit more the closer you get to the expiration date. You could also just decide you don't want the 30 day warning at all, and not have any actions associated with it. If you had email connectors in both groups, they're independent - they can have different messages.

pmuellr on 23 Apr 2020

@mdefazio to paste latest mockups.

@arisonl to ensure these mockups align with what solution teams are expecting.

mikecote on 9 May 2020

Here is the current iteration of the wireframes we've been discussing. Please let me know if anyone sees misalignments with what we've agreed upon. @arisonl

mdefazio on 11 May 2020

@mdefazio Some early feedback from Observability: Metrics expect to need two or three levels in most cases, with two (alert and warning) being the most usual use case and three (e.g. major, minor and warning) being also possible, yet less usual. They would also like the option to receive an alert for each transition, hence not just "resolved" but also once the value falls from warning to normal, from major to minor etc.

arisonl on 12 May 2020

Some initial thoughts on this:
Depending on what statuses/conditions are added in the trigger section, if the user checks "Run actions when resolved", a dropdown will provide options for 'Resolved', 'Alerting -> Warning', or 'Major --> Minor'.

I'm assuming that 'Warning --> Resolved' or 'Alerting --> Resolved' are the same as simply 'Resolved'. Is this correct?

If an alert goes from Warning to Alert, will it simply run the actions that are associated with 'Alert'? Or do we also need to provide transitional actions here as well?

mdefazio on 13 May 2020

Depending on what statuses/conditions are added in the trigger section, if the user checks "Run actions when resolved", a dropdown will provide options for 'Resolved', 'Alerting -> Warning', or 'Major --> Minor'.

Technically, these are "action groups" - [alert, warning], [major, minor], where resolved may end up being a "free" action group you get (and maybe "no data") (eg, [alert, warning, resolved, no-data]. Of course, that's a terrible phrase; status is not a great one either maybe, as we want to have an alert-level thing called "status" which would indicate no-data, error, actively firing kind of info.

I'm not sure a toggle for "Run actions when resolved" makes sense as a toggle, if it's just another action group.

I'm assuming that 'Warning --> Resolved' or 'Alerting --> Resolved' are the same as simply 'Resolved'. Is this correct?

That was my understanding. There is only "resolved", we won't have "minor->resolved" and "major->resolved" as separate things. I think technically, we could, but not sure we need it, and it makes things more complicated, so I'd say at best we defer that (and open a new issue if we think we need that).

If an alert goes from Warning to Alert, will it simply run the actions that are associated with 'Alert'? Or do we also need to provide transitional actions here as well?

Likewise, my understanding is that we won't have transitional actions like that, so you'd only see the 'Alert' actions run in that case. And as before, we could, but it's just more stuff, so worthy of a new issue (probably one issue for this and the previous note ^^^).

pmuellr on 14 May 2020

Say 1 is threshold for warning and 2 is threshold for alert. The current thinking is that, you will get:

a warning notification if you get from 0.5 to 1.5 and from 2.5 to 1.5
an alert notification if you get from 0.5 to 2.5 or from 1.5 to 2.5
a resolved notification if you get from 2.5 to 0.5

What I am hearing might be needed is a resolved notification when you get from 1.5 to 0.5, i.e. drop from warning to normal or from any other level to normal (e.g. from minor to normal).

arisonl on 15 May 2020

In an attempt to visualize this: (Updated to add in No data)

So what I'm gathering is that we only have a Resolved action group? Right?

They would also like the option to receive an alert for each transition, hence not just "resolved" but also once the value falls from warning to normal, from major to minor etc.

And so this is already built-in? I'm guessing we could simply say in the alerting message that 'X server has gone from warning to alert'. Or vice versa.

Do we indicate on the alert detail view the previous status in some way? So you could see it's in the alerting state, and it was previously in warning state/normal state.

mdefazio on 15 May 2020

The dropdown would then show the following

I'm showing 'Minor' as disabled with the thought that if they have not setup the condition in the trigger section, then they cannot choose it from the dropdown. But they would see all the available groups. And to re-state what @pmuellr was saying (so I understand correctly), The Alert, No data and Resolved would be built-in groups, whereas Warning would be defined by the Alert type.

The dropdown options could probably use some better ordering than what i'm showing in the screenshot.

mdefazio on 15 May 2020

The Alert, No data and Resolved would be built-in groups, whereas Warning would be defined by the Alert type.

Only No data and Resolved are "built-in" groups. I assumed Alert was a defined group by the alert type, like Warning and Minor.

Or maybe I'm misunderstanding what the group Alert means ...

pmuellr on 20 May 2020

Moving the discussion about single vs multi select for action group into #67863. It was discussed during the last iteration that we would start with single select when choosing the group and implement multi select capability in the future.

mikecote on 1 Jun 2020

This issue should also change the alert instances list to display something about the action group for each alert instance. Some mention about how it could work here: https://github.com/elastic/kibana/issues/78981#issuecomment-701722595.

mikecote on 1 Oct 2020

Here are the latest mockups for this: Per our recent discussion, I've added a badge to the collapsed state of the action that signifies to which action group it belongs. cc/ @gmmorris @mikecote

mdefazio on 2 Nov 2020

❤3

@mdefazio @gmmorris @mikecote @pmuellr a couple of questions on this design:

When I open the alert (now rule) definition (in order to revisit or edit) how do I answer the questions: _what actions happen if it's an alert?_ vs: _if it's a warning?_, and in the future vs: _if it's major, minor_ etc? Do I have to open up and go through each and every action?
When I create the alert and start attaching actions, how do I have to think as a user? E.g.:
1. Create an action-A and it happens when it is an alert. Create action-B and it happens when it is a warning. Create action C and it happens again when it's a warning. Create action-A again, and it happens when it is an alert again but perhaps with different parameters from the previous A action. Vs:
2. When it is an alert, create actions A, B, C. When it is warning, create actions X, Y, Z.
3. Which of the two is enabled with this design (or maybe it is a different one) and which is the most natural way? To me it feels that an extra cognitive step is required with 1 and 2i. Personally my natural way of thinking this is IF warning THEN action-A1, action-B, action-C| IF alert THEN action-A2. Instead of action-A IF warning| action-B1 IF warning |action-C IF alert |action-B IF warning?
If I want to only be notified on state change (btw state change might prove equally important like schedule-based), 2ii feels more natural.

I understand that there are other factors that come into the design but I would like to understand better the answers to these questions too. Maybe I am not reading this design correctly too, so please correct my parsing as described above.

arisonl on 12 Nov 2020

Personally my natural way of thinking this is IF warning THEN action-A1, action-B, action-C| IF alert THEN action-A2. Instead of action-A IF warning| action-B1 IF warning |action-C IF alert |action-B IF warning?

I get that, it is way easier. Unless you wanted to reuse an email action across action groups AND customize who it's sent to. Eg, send to a small # of people on alert, more on warning, even more error. But I assume you'd still have the opportunity to create a new action per action group anyway, so you could decide to reuse or create new, which ever you want.

I know we've talked about this action "reuse" before, and there is one little wrinkle I haven't given much thought recently (or maybe ever). We don't have a concept of "reusing" actions at the API level, we would HAVE to make a copy. Or add a new way in the API to refer to other actions within an alert definition. If we don't add some kind of reference, then we'd HAVE to copy, and the UI would then have to figure out the "reuse" itself. Probably not hard, just look for equivalent action definitions, treat those as "shared". But it also means if you create two actions that were the same, the UI would end up redisplaying this as "shared". That might be ok, but of course could also be very confusing.

pmuellr on 12 Nov 2020

If I want to only be notified on state change (btw state change might prove equally important like schedule-based), 2ii feels more natural.

By "state change" do you mean state as in ok | active | error, or action group? Since we recently discussed action groups, I'm guessing you mean ok | active | error.

Seems interesting, because it would be way easier for a customer to add this kind of notification if you are just interested in the change - you wouldn't need to add actions for action groups / resolved etc. Just one thing to create.

One potential issue with this is that the mustache variables available for ok and error are going to be wildly different than active - we already know 'ok' (aka 'resolved') won't get ANY of the "context" variables. Error likely won't have any either. Which would make creating a "nice" message to handle all these situations difficult.

pmuellr on 12 Nov 2020

@arisonl We raised some of these questions prior to the implemented PR - I'd suggest catching up on the recording. :)
(I apologise... I'm usually better at summarising our sync calls in the issue, but forgot to do it this time)

I personally think we should be grouping the actions by their action group in the UI as well, but that would clash with the concern around the fact that I need to duplicate an action for each action group... which isn't great either.
We decided to go with the simplest next step that wouldn't lock us into anything as it doesn't require changes to the underlying Saved Object.

We need to figure out how to balance these two problems and then we'll probably have a follow up issue.

gmmorris on 13 Nov 2020

@gmmorris I am aware that some of the above has been raised in the past, ref: "I understand that there are other factors that come into the design". However my understanding is that a big part might be driven by technical decisions, and that's absolutely fine but I am discussing the UX aspects here. Duplicating actions could be relatively easy with some type of cloning functionality. The cognitive overhead of resolving what happens might be more important from a UX perspective. Absolutely fine to be a followup issue.

arisonl on 13 Nov 2020

@arisonl Sorry, sounds like I gave you the impression I'm brushing your concerns off - if so, I apologise, that wasn't my intention. Quite the opposite. I'm confirming we are also worried about this and have been discussing it, just haven't done a good job of documenting it here.

However my understanding is that a big part might be driven by technical decisions, and that's absolutely fine but I am
discussing the UX aspects here.

Perhaps @mikecote & @mdefazio can weigh in here, but my understanding was that it's the UX challenges we're currently stuck on, not technical ones.
It isn't clear what UX would be better - to allow a user to select multiple action groups for a single action OR group all actions under their action group and enable the user to duplicate the action into another group.
The existing implementation is meant to take the smallest step that we could in order to enable the work on _Resolved_, without locking us in too soon- leaving the door open for either one of them as a next step.
Whether we prioritise that decision for 7.11 or a later minor is obviously open for discussion. 🤷‍♂️

gmmorris on 13 Nov 2020

@gmmorris

sounds like you think I'm brushing your concerns off

Not at all, we are all good and I am ++ with all you are writing.

It isn't clear what UX would be better

It's not to me either, that's why I posted these questions.

The exiting implementation is meant to take the smallest step that we could in order to enable the work on Resolved, without locking us in too soon- leaving the door open for either one of them as a next step.

+1000 on that too, that's the right approach, getting to the right place incrementally and I trust the team 1000% on that.

arisonl on 16 Nov 2020

❤1

@pmuellr I know that you've done a lot of thinking around this and I also cannot agree more with the [release it as fast as possible in order to add value and then iterate to optimise it] approach. And btw thank you for taking the time every time to get into the details and specifics :) My point is that consolidating things does not necessarily make it easier, e.g. if you need to unpack them in order to make sense of what happens under a number of conditions. That's a question for me, I don't have an answer. I would love to revisit in one of the next design meetings and I am sure we will get feedback and learn more as we move forward. I would think that there should be UX options to easily reuse and customise, in order to counter to a certain extent the disadvantage of the alternative approach.

do you mean state as in ok | active | error, or action group?

Sorry for not being clear: I meant "action groups" and some of the recent discussions triggered the questions I posted above. But that's a very good point too: I think that the term "action group" is not very descriptive from a user perspective in the context of alerts with multiple "levels" or severity (e.g. warning, alert etc.). _As a user I want to be notified when an alert changes level/severity_, as opposed to _As a user I want to be notified when an alert changes "action groups"_. I feel that the latter is less intuitive and it might be inspired by how it is implemented technically. If that's so (please correct me if I am wrong), we shouldn't require users to be familiar with it and we shouldn't bias our UX based on that either. Again these are all questions.

arisonl on 16 Nov 2020

Closing now that each individual issue is closed and merged.

mikecote on 20 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings