Kibana: Alert statuses

Created on 19 Nov 2019 · 17Comments · Source: elastic/kibana

To enrich the user experience within the alerts table (under Kibana management section), we should display the status for each alert.

To make sure we're on the same page on what alert statuses we should have, I've opened this issue for discussion. The UI would display them as a column within the alerts table and there would be a filter for the status. The statuses would be calculated on read based on the result of a few queries (activity log, alert instances, etc).

As a starting point, the mockups contain four potential statuses:

Active: The alert is actively firing
OK: The alert is running periodically and not firing anything
Error: The alert is throwing errors during execution
No Data: I'm thinking this is when the alert didn't run yet?

Is there any proposal for different statuses?

cc @elastic/kibana-stack-services @alexfrancoeur @peterschretlen

Alerting Alerting Services discuss

Source

mikecote

Most helpful comment

Mike and I just had a chat about the alert status, trying to figure out when to "calculate" it, and where it should be stored.

We ruled out "storing" it in the event log, or calculating it from the event log, as the event log is a stream of documents that can be difficult to reason over - we always want the "latest" thing, but that could possibly have been deleted by ILM, or delayed because of latency of buffered writing of the event log. The event log can also be disabled.

Seems like the best place is in the alert SO itself. This status would be set/reset when the alert executor is run.

Here's the current shape of the alert SO:

https://github.com/elastic/kibana/blob/6ee2460ebc0ccb673a13368cc05af861df19cd2e/x-pack/plugins/alerts/server/types.ts#L99-L117

We think adding a new top-level object is probably appropriate, I'll name it executionStatus here to make it clear this is the status of the execution (eg, nothing to do with muted/throttled/enabled/etc)

  executionStatus: {
    status: 'OK' | 'Active' | 'Error' | 'NoData';
    date: string;
    error: {
      reason: string;
      message: string;
    };
  };

status corresponds with the general status's we've discussed above.

date is the date the alert executor ran

We're only capturing one error now, as we can only have one error right now (the error is thrown, preventing further processing) - in the future if we hope to capture action execution errors as well, we might want multiple to accommodate multiple action errors (that also basically assumes actions would be run in the same task as the alert executor - something we may need to do in the future for other reasons). Something to think about ... We could certainly make it a nested object in terms of ES, if we think we may need it.

error.reason is some kind of string enumeration that will narrow down _where_ the error occurred; eg, get-failed, decryption-failed, alert-function-failed, (future thinking, actions-failed).

error.message is the thrown error message, probably in cases duplicating some of the error.reason part, eg error decrypting alert: <message from ESO here>.

We'd calculate this value whenever it was time to run the alert function. We'd like to capture ANY failure, from work before the function runs, including getting the alert, decrypting it, etc; up through the actual execution. Not sure if there's anything we'd want to catch AFTER the alert function runs (ignoring action execution for the moment), like an error during some kind of cleanup.

Once the alert execution is complete, successful or not, we update this field in the alert SO.

pmuellr on 12 Aug 2020

👍2

All 17 comments

I think those 4 are sufficient and having a small number is preferable.

No Data will depend on the alert type, but I think for a timeseries metric it would mean there's no data points in the period being checked (which can happen if a beat is removed or stops sending data for example)

Say for a CPU usage alert, if none of my metricbeat have send data 1 hour, and my alert is "when avg CPU is above 90% over the last 5 minutes" - there'd be no documents in elasticsearch and I would expect this to show "No Data" state.

peterschretlen on 20 Nov 2019

👍1

No Data sounds like:

from @mikecote: the alert type function has not yet run for this alert
from @peterschretlen: the meaning depends on the alert type; the alert type function may have run, but not done anything "semantically" because it hasn't gotten enough data yet

Both are actually interesting, but we don't have a mechanism to allow an alert type to return a "No Data" condition as in Peter's definition, that I'm aware of.

I'd say get rid of No Data for now, or change to something like has not run yet (Mike's definition). No Data sounds a bit confusing and vague to me.

For the remaining, how do we determine these values - the last state when the alert function ran? It either threw an error (Error), ran but scheduled no actions (OK), or ran and scheduled actions (Active). Just the last state seen? If so, perhaps storing that in the alert itself would be appropriate.

Presumably things like muted, *throttled show up in a separate column/icon/property indicating those states, so _that_ kind of state isn't appropriate for this "status".

pmuellr on 20 Nov 2019

If someone is authoring an alert, what do we expect them to do in the case where they don't have enough data to evaluate the condition? Throw an error? Return and treat it as normal?

No data/missing data is a pretty common scenario and I think it's an important cue. Data often arrives late, and it's not really an error but I wouldn't consider it ok either. Some systems will also let you notify on no data. Few examples:

If we don't treat it as a state here, we need account for it somewhere. I understand if we don't have a mechanism for it, but we could create one. It could be an expected type of error for example, thrown by an alert execution?

peterschretlen on 20 Nov 2019

One option I can see to add the mechanism to handle the "no data" scenario is to change the return structure of the alert type executor.

Currently it returns something like this:

return {
  // my updated alert level state
};

and we could change it to something like this:

return {
  noData: true,
  state: {
    // my updated alert level state
  },
};

Should be fairly straightforward to do and more future proof if ever we want to return more attributes than state from the executor.

Other options instead of noData: true could be status: 'no-data' or something like that.

mikecote on 21 Nov 2019

For the remaining, how do we determine these values - the last state when the alert function ran?

From how I see it, yes it would be based on the last execution / interval.

If so, perhaps storing that in the alert itself would be appropriate.

I think since we'll have a filter in the UI for statuses, it would make sense to store the status with the alert for searchability. After each execution, we would do an update on the alert document to update its status.

mikecote on 21 Nov 2019

re: the "no data" status

It sounds like this _could_ just be treated as an action group, for alert types that are sensitive to this. Eg, if they didn't have enough data, they'd schedule the action group "no-data", and could have whatever actions they wanted associated with that.

That would at least make that state "actionable", but wouldn't give us the ability to have it show up as a "status" value, without any kind of existing API changes, such as what Mike suggested.

If we end up making this part of the API signature, and alert status, feels like maybe "not enough data" is probably a better phrasing for this vs no data. Maybe something in the vein "inconclusive" or such ...

pmuellr on 21 Nov 2019

After each execution, we would do an update on the alert document to update its status

Ya, what I was thinking. Hopefully we can piggy-back this on top of an existing update, like the scheduling of the next run.

This also means we won't need the event log to determine that status ...

pmuellr on 21 Nov 2019

Posting this question here instead of slack:

If an alert is disabled, is the status then also disabled or is it the last status before it was disabled?

Perhaps just No data?

mdefazio on 6 Dec 2019

Also, is the warning level a status?

mdefazio on 6 Dec 2019

Also, is the warning level a status?

I think the status would be active (active = has one or more alert instances)?

If an alert is disabled, is the status then also disabled or is it the last status before it was disabled?

A disabled alert has no status - could it be blank? If we need a value to filter on then I think disabled as a state is ok. no data has a special meaning, I don't think it works on a disabled alert.

peterschretlen on 6 Dec 2019

Repeating a comment from https://github.com/elastic/kibana/issues/58366#issue-569963925: we should be able to filter alerts by their status if possible.

peterschretlen on 23 Mar 2020

One thing not mentioned yet is "alert instance status". It seems like an alert instance can have most of the status values of the alert itself, except perhaps "error", since "error" indicates the alert executor ran into some problem. Note this specifically includes "no data", as some alert types may know the possible domain of their instances, and be able to determine if an instance has not produced data. But not all alerts will be able to do this - index threshold for instance doesn't know the domain of the possible groupings it uses for it's instance ids.

pmuellr on 13 Jul 2020

Mike and I just had a chat about the alert status, trying to figure out when to "calculate" it, and where it should be stored.

Seems like the best place is in the alert SO itself. This status would be set/reset when the alert executor is run.

Here's the current shape of the alert SO:

https://github.com/elastic/kibana/blob/6ee2460ebc0ccb673a13368cc05af861df19cd2e/x-pack/plugins/alerts/server/types.ts#L99-L117

  executionStatus: {
    status: 'OK' | 'Active' | 'Error' | 'NoData';
    date: string;
    error: {
      reason: string;
      message: string;
    };
  };

status corresponds with the general status's we've discussed above.

date is the date the alert executor ran

error.message is the thrown error message, probably in cases duplicating some of the error.reason part, eg error decrypting alert: <message from ESO here>.

Once the alert execution is complete, successful or not, we update this field in the alert SO.

pmuellr on 12 Aug 2020

👍2

Note also that it seems like we'll want almost the exact same thing for actions, calculated during action execution, and stored in the action SO.

pmuellr on 12 Aug 2020

To capture the NoData status, I think we should change the alert type executor response. Instead of sending the state at the root, we can nest it under state. The new return shape could look something like:

noData?: boolean;
state: Record<string, unknown>;

Though I'm not sure how to handle when noData is returned as true and the executor scheduled actions 🤔 Maybe ignore the no data indicator or log a warning, throw an error, ..?

mikecote on 14 Aug 2020

My original thought was to add a service method to indicate the alert was in a "no data" condition, but this would of course work as well. Probably better, as it's more amenable to a quick check after the business logic and an early return.

Ya, I think if noData is true and they've scheduled actions, ignore the noData and log warnings.

pmuellr on 17 Aug 2020

Just happened to think in a chat with Mike, we'll have the opportunity to "migrate" old alerts to contain data in this new executionStatus object, but how could we possibly get data to put in it? Presumably we could get some parts of it from the alert state, but I don't think you can access other SOs during a migration (seems horribly complicated!).

I think we're only talking about the status and date fields - the error field can always be null.

And it's not really important what's in the SO itself, but what we return from alertClient methods and http requests. So, do we want these to be optional? What a PITA that would be, when the only possible time they could be null is right after a migration, up until the alert function is executed for the first time after a migration.

Thinking we can have another status value of "unknown", that we can use in a case like this, and may come in handy later as well. We'll want to add a release note about this, if it ends up showing up in the UI - not sure it will or not.

I don't think we will, looking at the current web ui. But that made me realize we probably want this new status field in the alerts table view: