Kibana: [ML] Data Frames: Fix automatic date formatting regression in preview tables

Created on 19 Jun 2019  路  4Comments  路  Source: elastic/kibana

Currently, when creating a date_histogram based group-by, the UI creates a configuration that maps certain date formats to date histogram intervals:

export enum DATE_HISTOGRAM_FORMAT {
  ms = 'yyyy-MM-dd HH:mm:ss.SSS',
  s = 'yyyy-MM-dd HH:mm:ss',
  m = 'yyyy-MM-dd HH:mm',
  h = 'yyyy-MM-dd HH:00',
  d = 'yyyy-MM-dd',
  M = 'yyyy-MM-01',
  y = 'yyyy',
}

The available intervals in the UI are 1m, 1h, 1d, 1w, 1M, 1q, 1y. So first of all, this is missing mappings for 1w and 1q.

Another thing to re-evaluate: Do we want the date formats to become less granular with the intervals like in the mappings above? For example, is it even desirable to do yyyy instead of the default or something more fine grained like yyyy-MM-dd HH:mm:ss.SSS?

Here's the background why this was done this way originally:

  • In the UI, my thinking was that it might be more useful to have a date as short as possible suitable for the group-by interval, for example 2019 instead of 2019-01-01 00:00:00 ... to avoid those long dates in the preview table.
  • An additional thought was, using a format like this returns dates based on the interval chosen for the group-by. It allows you to infer/recognize the interval just by looking at the destination index documents. If the format was just epoch_millis or something like yyyy-MM-dd HH:mm:ss.SSS then you could not tell which interval the group-by is based on just by looking at destination index documents. You'd have look back at the job's group-by config to learn which interval the destination index is based on.

Update 2019-06-27

After discussions, this is what we need to do:

  • [x] Remove all code that adds a format automatically
  • [x] Once the _preview endpoint returns a proposed mapping, update the preview table to properly format date fields to human readable formats.
:ml Transforms bug v7.3.0 v7.4.0

Most helpful comment

We discussed this on the weekly data frame sync call.

the deduced mappings logic seems flawed

This is definitely the case. If we support custom formatting options then the custom format should be the _only_ format in the mappings for that field.

A new consideration that emerged is that for Kibana timezones to work dates need to be supplied to Kibana in epoch millisecond format and then Kibana can convert this to text _in the configured timezone_. If the dates are already formatted as text by the backend then the timezone is likely to be wrong in Kibana.

Therefore we decided that it would be best if data frame transforms did not permit custom formats at all, and always used the default output of aggregations which is epoch milliseconds. Then the UI needs to format these to human readable strings in the correct timezone. To do this it needs to know which fields represent dates. When displaying search results this information is in the index mappings and the problem is no different to displaying any search results in Kibana. For the _preview endpoint it would be possible for us to return some mappings as part of the _preview response. The current _preview response is of the form:

{
  "preview" : [
    {
     ... doc1 ...
    },
    {
     ... doc2 ...
    },
    ...
  ]
}

This could be expanded to include the mappings:

{
  "mappings" : {
    "field1" : {
      "type" : "date"
    },
    ...
  },
  "preview" : [
    {
     ... doc1 ...
    },
    {
     ... doc2 ...
    },
    ...
  ]
}

To summarise:

  1. Validation will disallow custom formats in date_histograms in data frame transforms
  2. Always store dates in epoch milliseconds format in the data frame destination index
  3. The _preview endpoint will return a preview of the mappings as well as the preview documents
  4. The UI will format timestamps using the configured timezone for fields in the preview output that have date mappings in the preview mappings

_preview should be human read-able

We could possibly do this too, adding support for human=true to the _preview response. However, the UI would not use this.

All 4 comments

Pinging @elastic/ml-ui

Two things are in play here 1) the visual presentation of information from _preview and 2) the shape of the data that is being transformed. Unfortunately, the formatting of the group-by fields when using a date_histogram affects both.

Looking at the big picture, the transform is the primary functionality, not _preview.

When deducing mappings from the source to dest index, I believe we want to keep date fields as date fields as a general policy. _A user may want to map a source date field to a dest string (for example CY-2019) and this will be possible by creatively using the API, but our "standard" operation should map a date field to a date field._

If we assume we are mapping to a date field, then ideally the data from the aggregation that is indexed should contain the full fidelity date. i.e. 2019 is not enough, this should be the full epoch_ms value or an ISO date. So far we have seen composite aggs return epoch_ms providing the date format is not specified.

Our UI should be used to create "best practice" data frame configurations. With this in mind, then I don't think we should be specifying a format that reduces the fidelity of the data.

This leaves us with a few areas to think about

  1. _preview should be human read-able
    I don't agree that the date should be as short as possible, but human read-able is something to aim for. 2019-01-01 00:00 is fine .. epoch_ms is not .. we are better off using a well-recognised date format as it avoids having special handling for 1w and 1q.
    Does this need to be part of the config? Could this be handled in the UI? Could we have _preview?pretty=human?

  2. the deduced mappings logic seems flawed
    Described in https://github.com/elastic/kibana/issues/38926 .. because yyyy is the last formatting option, 2019 is assumed to be 2019 ms after epoch start.

  3. people can still use the API and add in custom formats
    Can we add any validations to help avoid failures.

We discussed this on the weekly data frame sync call.

the deduced mappings logic seems flawed

This is definitely the case. If we support custom formatting options then the custom format should be the _only_ format in the mappings for that field.

A new consideration that emerged is that for Kibana timezones to work dates need to be supplied to Kibana in epoch millisecond format and then Kibana can convert this to text _in the configured timezone_. If the dates are already formatted as text by the backend then the timezone is likely to be wrong in Kibana.

Therefore we decided that it would be best if data frame transforms did not permit custom formats at all, and always used the default output of aggregations which is epoch milliseconds. Then the UI needs to format these to human readable strings in the correct timezone. To do this it needs to know which fields represent dates. When displaying search results this information is in the index mappings and the problem is no different to displaying any search results in Kibana. For the _preview endpoint it would be possible for us to return some mappings as part of the _preview response. The current _preview response is of the form:

{
  "preview" : [
    {
     ... doc1 ...
    },
    {
     ... doc2 ...
    },
    ...
  ]
}

This could be expanded to include the mappings:

{
  "mappings" : {
    "field1" : {
      "type" : "date"
    },
    ...
  },
  "preview" : [
    {
     ... doc1 ...
    },
    {
     ... doc2 ...
    },
    ...
  ]
}

To summarise:

  1. Validation will disallow custom formats in date_histograms in data frame transforms
  2. Always store dates in epoch milliseconds format in the data frame destination index
  3. The _preview endpoint will return a preview of the mappings as well as the preview documents
  4. The UI will format timestamps using the configured timezone for fields in the preview output that have date mappings in the preview mappings

_preview should be human read-able

We could possibly do this too, adding support for human=true to the _preview response. However, the UI would not use this.

For those finding this issue and looking for a solution to index date_histogram timestamps with a custom date time format:

You can create an ingest pipeline:

PUT _ingest/pipeline/custom_date
{
  "description" : "Set a custom date",
  "processors" : [
    {
      "date" : {
        "field" : "timestamp",
        "target_field" : "my-timestamp",
        "formats" : ["UNIX_MS"],
        "output_format" : "yyyy-MM-dd HH:mm:ss"
      }
    }
  ]
}

and specify it as part of your transform:

"dest":{
  "index":"hourly_data_aggregations",
  "pipeline": "custom_date"
},

Note that you need to create the transform destination index with the right mapping.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stigdescamps picture stigdescamps  路  88Comments

Alex-Ikanow picture Alex-Ikanow  路  364Comments

srl295 picture srl295  路  104Comments

pkubat picture pkubat  路  75Comments

stormpython picture stormpython  路  74Comments