Currently, when creating a date_histogram based group-by, the UI creates a configuration that maps certain date formats to date histogram intervals:
export enum DATE_HISTOGRAM_FORMAT {
ms = 'yyyy-MM-dd HH:mm:ss.SSS',
s = 'yyyy-MM-dd HH:mm:ss',
m = 'yyyy-MM-dd HH:mm',
h = 'yyyy-MM-dd HH:00',
d = 'yyyy-MM-dd',
M = 'yyyy-MM-01',
y = 'yyyy',
}
The available intervals in the UI are 1m, 1h, 1d, 1w, 1M, 1q, 1y. So first of all, this is missing mappings for 1w and 1q.
Another thing to re-evaluate: Do we want the date formats to become less granular with the intervals like in the mappings above? For example, is it even desirable to do yyyy instead of the default or something more fine grained like yyyy-MM-dd HH:mm:ss.SSS?
Here's the background why this was done this way originally:
2019 instead of 2019-01-01 00:00:00 ... to avoid those long dates in the preview table.epoch_millis or something like yyyy-MM-dd HH:mm:ss.SSS then you could not tell which interval the group-by is based on just by looking at destination index documents. You'd have look back at the job's group-by config to learn which interval the destination index is based on.Update 2019-06-27
After discussions, this is what we need to do:
_preview endpoint returns a proposed mapping, update the preview table to properly format date fields to human readable formats.Pinging @elastic/ml-ui
Two things are in play here 1) the visual presentation of information from _preview and 2) the shape of the data that is being transformed. Unfortunately, the formatting of the group-by fields when using a date_histogram affects both.
Looking at the big picture, the transform is the primary functionality, not _preview.
When deducing mappings from the source to dest index, I believe we want to keep date fields as date fields as a general policy. _A user may want to map a source date field to a dest string (for example CY-2019) and this will be possible by creatively using the API, but our "standard" operation should map a date field to a date field._
If we assume we are mapping to a date field, then ideally the data from the aggregation that is indexed should contain the full fidelity date. i.e. 2019 is not enough, this should be the full epoch_ms value or an ISO date. So far we have seen composite aggs return epoch_ms providing the date format is not specified.
Our UI should be used to create "best practice" data frame configurations. With this in mind, then I don't think we should be specifying a format that reduces the fidelity of the data.
This leaves us with a few areas to think about
_preview should be human read-able
I don't agree that the date should be as short as possible, but human read-able is something to aim for. 2019-01-01 00:00 is fine .. epoch_ms is not .. we are better off using a well-recognised date format as it avoids having special handling for 1w and 1q.
Does this need to be part of the config? Could this be handled in the UI? Could we have _preview?pretty=human?
the deduced mappings logic seems flawed
Described in https://github.com/elastic/kibana/issues/38926 .. because yyyy is the last formatting option, 2019 is assumed to be 2019 ms after epoch start.
people can still use the API and add in custom formats
Can we add any validations to help avoid failures.
We discussed this on the weekly data frame sync call.
the deduced mappings logic seems flawed
This is definitely the case. If we support custom formatting options then the custom format should be the _only_ format in the mappings for that field.
A new consideration that emerged is that for Kibana timezones to work dates need to be supplied to Kibana in epoch millisecond format and then Kibana can convert this to text _in the configured timezone_. If the dates are already formatted as text by the backend then the timezone is likely to be wrong in Kibana.
Therefore we decided that it would be best if data frame transforms did not permit custom formats at all, and always used the default output of aggregations which is epoch milliseconds. Then the UI needs to format these to human readable strings in the correct timezone. To do this it needs to know which fields represent dates. When displaying search results this information is in the index mappings and the problem is no different to displaying any search results in Kibana. For the _preview endpoint it would be possible for us to return some mappings as part of the _preview response. The current _preview response is of the form:
{
"preview" : [
{
... doc1 ...
},
{
... doc2 ...
},
...
]
}
This could be expanded to include the mappings:
{
"mappings" : {
"field1" : {
"type" : "date"
},
...
},
"preview" : [
{
... doc1 ...
},
{
... doc2 ...
},
...
]
}
To summarise:
date_histograms in data frame transforms_preview endpoint will return a preview of the mappings as well as the preview documents
_previewshould be human read-able
We could possibly do this too, adding support for human=true to the _preview response. However, the UI would not use this.
For those finding this issue and looking for a solution to index date_histogram timestamps with a custom date time format:
You can create an ingest pipeline:
PUT _ingest/pipeline/custom_date
{
"description" : "Set a custom date",
"processors" : [
{
"date" : {
"field" : "timestamp",
"target_field" : "my-timestamp",
"formats" : ["UNIX_MS"],
"output_format" : "yyyy-MM-dd HH:mm:ss"
}
}
]
}
and specify it as part of your transform:
"dest":{
"index":"hourly_data_aggregations",
"pipeline": "custom_date"
},
Note that you need to create the transform destination index with the right mapping.
Most helpful comment
We discussed this on the weekly data frame sync call.
This is definitely the case. If we support custom formatting options then the custom format should be the _only_ format in the mappings for that field.
A new consideration that emerged is that for Kibana timezones to work dates need to be supplied to Kibana in epoch millisecond format and then Kibana can convert this to text _in the configured timezone_. If the dates are already formatted as text by the backend then the timezone is likely to be wrong in Kibana.
Therefore we decided that it would be best if data frame transforms did not permit custom formats at all, and always used the default output of aggregations which is epoch milliseconds. Then the UI needs to format these to human readable strings in the correct timezone. To do this it needs to know which fields represent dates. When displaying search results this information is in the index mappings and the problem is no different to displaying any search results in Kibana. For the
_previewendpoint it would be possible for us to return some mappings as part of the_previewresponse. The current_previewresponse is of the form:This could be expanded to include the mappings:
To summarise:
date_histograms in data frame transforms_previewendpoint will return a preview of the mappings as well as the preview documentsWe could possibly do this too, adding support for
human=trueto the_previewresponse. However, the UI would not use this.