Kibana: API to get all active instances from Observability consumers

Created on 29 Jun 2020 · 9Comments · Source: elastic/kibana

In the new Observability Overview page, we're planning to show two charts to give the user a clear picture of which alert is active at the moment.

In this chart, we want to show all active instances for all observability plugins (APM/Logs/Uptime/Metrics) grouped by type.
Screenshot 2020-06-25 at 12 40 51

And in this one, we want to show some alert detail and the number of active instances next to it.
Screenshot 2020-06-29 at 10 21 32

Current situation:
In the current API to get this information I have to first call _find to get all created alerts, then filter by Observability plugins (APM/Logs/Uptime/Metrics), and make an HTTP call for each alert to get the active instances.

What a need:
An API that returns all active instances and the alert details, with the possibility to filter by consumer and alert type.

Example API:

alerting.getInstances({ active: true, consumers: ['apm', 'uptime', 'metrics'] })

Example response:

[
  {
    "id": "b5ef31a1-7c9f-47f5-a0d4-69169fc2f407",
    "params": {
      "threshold": 1,
      "aggregationType": "avg",
      "windowSize": 5,
      "windowUnit": "m",
      "transactionType": "request",
      "environment": "ENVIRONMENT_ALL",
      "serviceName": "opbeans-java"
    },
    "consumer": "apm",
    "alertTypeId": "apm.transaction_duration",
    "schedule": {
      "interval": "10s"
    },
    "actions": [
      {
        "actionTypeId": ".webhook",
        "group": "threshold_met",
        "params": {
          "body": "{\"transaction\": \"transaction\"}"
        },
        "id": "4e6a507f-1238-49c1-8b55-c19e42076543"
      }
    ],
    "tags": ["apm", "service.name:opbeans-java"],
    "name": "Transaction duration | opbeans-java",
    "throttle": "15s",
    "enabled": true,
    "apiKeyOwner": "elastic",
    "createdBy": "elastic",
    "updatedBy": "elastic",
    "createdAt": "2020-06-25T14:27:19.820Z",
    "muteAll": false,
    "mutedInstanceIds": [],
    "scheduledTaskId": "fad2cf20-b6ef-11ea-9623-a57005710a46",
    "updatedAt": "2020-06-25T14:27:21.257Z",

    //All active instances
    "alertInstances": [
      {
        "state": {},
        "meta": {
          "lastScheduledActions": {
            "group": "threshold_met",
            "date": "2020-06-29T08:31:38.802Z"
          }
        }
      }
    ]
  }
]

Alerting Alerting Services v7.10.0

Source

cauemarcondes

Most helpful comment

We've re-prioritized some work that - I think - will happen to work out very well for this requirement.

We will be formalizing the notion of an alert "status" per issue https://github.com/elastic/kibana/issues/51099 . We'll add a new status object to the alert saved object, which means you should be able to get the status from the alertClient find() API (or equivalent http call), including usual saved object filtering, fields, etc. I think this would mean having to retrieve all the alerts with find(), and manually generating the numbers, based on the alert type and status.

That gets us back down from 1+x or even 2 api calls, down to 1! (but with more data than actually required, I think)

I'm going to start working on this shortly, will note the PR here once it's under way.

pmuellr on 18 Aug 2020

👍3

All 9 comments

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

elasticmachine on 29 Jun 2020

Hello, APM service maps also needs this capability as well for 7.9. We need to be able to show display health indicators or all services in the service map which have active alerts violations. Right now, we can only get it to work by calling getAlertStatein parallel for each id we get from find, but it is prohibitively inefficient, especially for very large service maps. Something where we can get all the alert statuses in one go is required before we can integrate.

ogupte on 29 Jun 2020

@mikecote I see you've added this to "Long Term". This is something we hope to be able to have available in 7.10. Is that possible?

sqren on 22 Jul 2020

@sqren I went over the recording of the triage session we had for this issue. I think we needed more clarifications on if this issue was still needed or if your requirements have changed based on the scope adjustment the homepage team made for 7.9 / 7.10. We placed it with the bulk APIs story (long term) and had an approach we believe could work for you without waiting on this API (some email thread from a few weeks ago).

@pmuellr can help on this. The approach that could work for now is to use the alert find API to get all observability related alerts (filter by alert type and/or consumer) and then use the task manager's fetch API for the alert's scheduledTaskId. With that result, each task will contain the state of an alert and you can then extract the instances from there.

We can always revisit and prioritize this issue no problem, probably in the scope of 7.11 once our work for GA is complete.

mikecote on 22 Jul 2020

@XavierM has a PR which adds aggregations to the SavedObjectsClient, can we take advantage of this here?

kobelb on 23 Jul 2020

@mikecote

... The approach that could work for now is to use the alert find API to get all observability related alerts (filter by alert type and/or consumer) and then use the task manager's fetch API for the alert's scheduledTaskId. With that result, each task will contain the state of an alert and you can then extract the instances from there.

_From an other thread:_

I believe the workaround that has been suggested was similar to the approach mentioned earlier. It works by fetching the task state for each matching alert returned in the find call. In setups with few alerts configured, it will add a few requests to the page load. But what if we're trying to load a service map with 10, 20, or more services where each has an alert configured? Are we OK adding x number of single requests to our initial page load?

From the alerting plugin context, it might be possible to obtain multiple alerts states in a single request, but it would require querying the task manager index filtered by job ids obtained in the initial find. This would result in the initial page load adding a constant 2 additional requests instead of the suggested 1 + x requests.

ogupte on 28 Jul 2020

We've re-prioritized some work that - I think - will happen to work out very well for this requirement.

That gets us back down from 1+x or even 2 api calls, down to 1! (but with more data than actually required, I think)

I'm going to start working on this shortly, will note the PR here once it's under way.

pmuellr on 18 Aug 2020

👍3

@pmuellr Thanks for the heads up - it sounds very exciting!

formgeist on 19 Aug 2020

Doing some bookkeeping, realized I didn't post the PR with the new 'alert status' field - it's here: https://github.com/elastic/kibana/pull/75553

But also, re-reading this, and realizing the original request, that still doesn't give us instance data, just the alert data. So, that still leaves us in a 1 + n requests state - 1 find() request to get the alerts, and then n calls to get the instance data.

It feels to me like we'll end up needing some new APIs, and I don't think we've talked about what those might look like, so here's a rough sketch:

new method on alerts client that takes find() parameters, and returns instance data about all the matching alerts; this would internally use find(), then make a single call (well, probably have to deal with pagination, but one "virtual" call) to the event log to query against all the alert SO's returned from find(). We'd likely need to process the events returned to get whatever data we're looking for, much like the current "get instance status" API (which returns instance data for a single alert)
http API that calls that new alerts client API
some changes to the event log to bypass the current checks on the saved object being queried for event data - that's done for security reasons (you need to be able to read an alert to see it's events) - because we've already done that check in the find() call to get the list of alerts

I should note this would be to get instance data beyond just the current state of known instances (eg, it could return data about _recent_ instances which are no longer active, like the current "get instance status" API). If we only need the current list of instances, or count of instances, it's possible we could do a query over task manager to get the current alert instance data. This also wouldn't contain any instance status data like errors. Here's what that task manager data looks like (note, it's stored as a JSON string today, so we'd need to parse it after fetching and can't search over these "fields"); this shows an alert with one active instance, host-1:

{
  "alertInstances": {
    "host-1": {
      "state": {},
      "meta": {
        "lastScheduledActions": {
          "group": "threshold met",
          "date": "2020-09-30T21:40:14.771Z"
        }
      }
    }
  },
  "previousStartedAt": "2020-09-30T21:40:14.664Z"
}

pmuellr on 30 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings