Kibana: [ML] Add structured tags to ML anomaly data points to make it possible to query for them

Created on 21 May 2020 · 16Comments · Source: elastic/kibana

Currently it is only possible to query for anomaly data points by job_id. The problem with the job_id it's not easy to query for specific attributes, and mostly we have to parse the job_id on the client to determine what service or transaction type the data point represent.

Example
A job id might be opbeans-node-request-high_mean_response_time. We can make a helper function that extracts the service name (opbeans-node) and transaction type (request). But a job could span all transaction types and will therefore not include transaction type: opbeans-node-high_mean_response_time. Additionally we are soon going to add support for jobs per environment: opbeans-node-production-high_mean_response_time (where "production" is the env). This makes the job_id fragile.

Instead I propose that ML data points should contain user defined tags. This is how I'd like to be able to query for anomaly data:

Get anomaly data:

GET .ml-anomalies-*/_search

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "result_type": "record" } },
        { "term": { "service.name": "opbeans-node" } },
        { "term": { "service.environment": "production" } },
        { "term": { "transaction.type": "request" } }
      ]
    }
  }
}

Create ML job

This is how I propose the API for creating an ML job should look like:

POST /api/ml/modules/setup/apm_transaction

{
    index: 'apm-*',
    tags: {
        "service.name": "opbeans-node",
        "service.environment": "production",
        "transaction.type": "request"
    },
    startDatafeed: true,
    query: {
      bool: {
        filter: {}
      }
    }
  }

:ml Anomaly Detection ml v7.11.0

Source

sqren

All 16 comments

Pinging @elastic/ml-ui (:ml)

elasticmachine on 21 May 2020

It looks like proposed change would need to go in the job config and so should be an elasticsearch issue. @droberts195 would you agree?
One possible way to implement this would be to add this tags section to the custom_settings which can contain job meta data.

jgowdyelastic on 21 May 2020

It looks like proposed change would need to go in the job config and so should be an elasticsearch issue.

Yes, certainly part of the request is on the Elasticsearch side. It's asking for extra fields in every result written by the anomaly detector.

There is another side to this though, which is that once we complete the "ML in spaces" project it won't be desirable for Kibana apps to directly search the ML results index, but instead go through APIs in the ML UI. In the example of searching results by tag, no job ID is specified. So that implies the ML UI would provide a space-aware results endpoint that could search for results by tag, but taking into account which jobs are visible in the current space.

So this functionality is non-trivial both on the Elasticsearch side and the Kibana side.

droberts195 on 21 May 2020

Maybe job groups could achieve what is required here. It's getting late in my day, but another day we should think through more carefully how job groups could be used instead of adding more functionality that is doing something quite similar. If the job groups feature doesn't work as it stands then it may be better to meet this requirement by enhancing job groups rather than adding new overlapping functionality and then having someone in the future ask why we have both tags and job groups.

droberts195 on 21 May 2020

We discussed this on a Zoom call.

It turns out there shouldn't be a need to aggregate different values of service.environment in the same query - it doesn't make sense to combine results from testing and production for example. So it's OK that there are separate jobs per environment whose results cannot easily be aggregated.

We already add tags for a jobs "by" and "partition" fields. Therefore we agreed the requirement can be met by configuring "by_field_name" : "transaction.type" and "partition_field_name" : "service.name" for every detector in each job.

It will then be possible to do terms aggregations or terms filtering on documents with "result_type" : "record" using the fields service.name and transaction.type which will be present in such documents.

droberts195 on 4 Jun 2020

@droberts195 Is there any difference between:

"by_field_name" : "transaction.type" and "partition_field_name" : "service.name"

"by_field_name" : "service.name" and "partition_field_name" : "transaction.type"

Asking because by_field_name and partition_field_name seem very similar to me. I'd expect to define it like:

"dimensions": ["service.name", "transaction.type" ]

sqren on 5 Jun 2020

@sqren By and partition fields behave differently in how the results are aggregated up the results hierarchy.

With the by field, if multiple values are anomalous at the same time then the overall bucket is considered more anomalous. With the partition field we consider individual behaviours. So, use by_field_name if the values are (sometimes) related, use the partition_field_name if they are individual.

Based on what we know about your data it makes more sense for the config

"by_field_name" : "transaction.type" and "partition_field_name" : "service.name"

sophiec20 on 5 Jun 2020

Thanks for the background @sophiec20.
I still have a few questions - please bear with me :)

So, use by_field_name if the values are (sometimes) related, use the partition_field_name if they are individual.

Based on what we know about your data it makes more sense for the config
"by_field_name" : "transaction.type" and "partition_field_name" : "service.name"

transaction.type and service.name are dimensions in the composite aggregation but are not themselves indicative of anomalies (transaction.duration.us is the anomalous part). So I'm not understanding why transaction.type is grouped by by_field_name and service.name by partition_field_name. I think of them as sibling/equivalent dimensions.

So something like this would intuitively make more sense to me:

by_field_name: ["service.name", "transaction.type"]

Based on what we know about your data it makes more sense for the config

Is this opbeans data, or apm data in general? Just wondering if we are optimizing for sample data, instead of real customer data.

sqren on 11 Jun 2020

The anomaly detection modelling is complex, see https://github.com/elastic/ml-cpp .. so fundamental changes to the way jobs are configured and data is modelled is not trivial. It is not a visualisation of an aggregation and there are significant bwc implications to both the modelling and the elasticsearch APIs.

Some bulk APM data was made available to @blaklaybul last week and we are now working through the prototypes for job configurations as we've discussed above. It is always preferred to optimise against real customer data providing this usage of data is permitted. We are working with the data provided to us.

Once these prototype job configurations are ready we can walk through and explain results against data examples and show how this can support the requirement given regarding labelled results.

sophiec20 on 11 Jun 2020

Okay, I just want to make sure we are on the same page.

What we are interested in is very much the same behaviour we get today by starting individual jobs. To simplify the experience for users it would be beneficial if we can start a single job where anomalies are separated by a number of dimensions (service.name, transaction.type and service.environment).

Do you see by_field_name and partition_field_name as temporary workarounds or as the permanent solution towards this goal?

sqren on 11 Jun 2020

The prototype @blaklaybul has made goes a long way by promoting service.name and transaction.type to first-class fields (via by_field_name and partition_field_name). These are added to the ML job and are propagated to ML results which is great! We were however not able to find a similar solution for service.environment, and still need to be able to query for ML jobs that belongs to a particular environment.

We've briefly talked about adding service.environment to ml jobs as a job group. We hoped this would allow us to retrieve jobs by environment but there are two problems with job groups:

they only allow a very limited character set
they are user facing and user editable

limited character set
According to the ML docs, job groups may only contain "lowercase alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start and end with alphanumeric characters". Since we don't have similar restriction for environment we have to encode it before storing it as a job group. We can't use standard encodings like url or base64 encoding since they both require additional characters to be supported (%, =, uppercase letters etc).
Instead we must create a custom conversion like lowercasing all letters and removing special characters. This is a lossy irreversible operation that makes it impossible to retrieve the original value from the job group. Additionally it creates the risk of having naming conflicts. Example: If two services are called the same but with different casing, they'll be converted to the same value.

User editable
If a user removes or edits a job group the integration with APM will break. This will surprise users so we should avoid this happening. Job groups are user facing and doesn't come with any warning that editing them might break integrations. This is understandable since job groups were not made for the purpose we are using them for.
In short: using job groups for metadata is both complex and unreliable.

Suggestion
Instead of storing metadata that we want to query for as job groups I suggest something like the "system tags" that alerting is also looking into.
System tags are similar to user facing tags (job groups) except they do not restrict the character set anymore than elasticsearch does, and they will not be editable (and perhaps not even displayed) in the ui.

Timeline
The plan is still to ship the new ML job in 7.9 but we'll need to find a way to retrieve ml jobs by service.environment somehow. We could ignore the drawbacks listed above and use job groups. This would introduce complexity and the integration would be fragile. We wouldn't be able to easily migrate to system tags should they become available at a later stage.
Having system tags available in 7.9 is therefore a high priority to APM.

sqren on 24 Jun 2020

~Another reason something like system tags would be beneficial: since they are indexed, any filtering can be done in ES. Right now in order to implement something similar with job groups, you have to fetch all ML jobs, and do any filtering/matching of group strings in app code.~

My mistake I'm thinking of something else.

ogupte on 24 Jun 2020

For 7.9 APM will use the existing custom_settings field in the ML job to tag the environment, by passing a jobOverrides parameter to the ML modules setup function, in the form:

jobOverrides: [
      {
        custom_settings: {
          job_tags: { environment },
        },
      },
    ],

This custom_settings field can then be used on the Kibana-side for filtering as required by environment.

However it is acknowledged that this solution is not ideal, so as part of the on-going project to make ML jobs space-aware, work will start in 7.10 to store ML jobs as Kibana saved objects which will allow us to store meta data, such as 'system tags', as part of the saved object. This has the advantages of:

Meta data is kept Kibana-side, which keeps business logic out of elasticsearch
Not easily editable
Supports full character set
Kibana saved objects already support filtering
Could be extended to limit the ML UI from displaying jobs with system tags.

peteharverson on 15 Jul 2020

👍1

This sounds great @peteharverson!
I'll add that in addition to storing metadata in custom_settings we also store it in groups.
Have you thought more about the migration from custom_settings and groups to the saved objects? Would this happen automatically when the user upgrades?

sqren on 15 Jul 2020

Good question @sqren . Yes, we are planning on adding a number of checks around the Spaces / Saved Objects on start-up when the user upgrades, and this should definitely include checking for the job_tags that APM are using in 7.9 so that they can be added to the saved object meta data.

peteharverson on 16 Jul 2020

👍1

^^ My ER from 3 years ago now has a chance! :)