Elasticsearch: Support for a fully numeric flattened field

Created on 25 Aug 2020  路  4Comments  路  Source: elastic/elasticsearch

This issue is a spinoff of #43805 that focuses on a specific use case: supporting numeric fields in the flattened field.
We've discussed this internally and agreed that it is something that we'd like to provide.
This new field could be considered as the numeric version of the flattened field where all values should be parseable as numbers. The details of the implementation are still unclear but multiple ideas were shared internally:

  • We could reuse the framework added for the rank_feature query where field names could be indexed as terms and values as frequencies.

  • We could use points with multiple dimensions and/or prefixes/suffixes to index the pair field name, value.

This issue is a placeholder to provide feedback and updates on the overall plan (supporting a fully numeric flattened field).

:SearcMapping >enhancement Search

Most helpful comment

@polyfractal brought up the good point that in some telemetry use cases, all values represent counts. This type of data is similar to a histogram, but with labeled buckets. For example, we could be tracking the usage of every aggregation:

{
  "agg_usage": {
    "terms": 101,
    "date_histogram": 2450,
    ...
  }
}

It would be natural to perform a histogram-like aggregation on agg_usage to sum up the counts for each entry terms, date_histogram, etc. When designing the feature, it'd be good to keep this case in mind -- for example, it could affect whether we want to distinguish long counts vs. arbitrary numerics.

All 4 comments

Pinging @elastic/es-search (:Search/Mapping)

Once we have this field, I guess the next question will be how to deal with objects that have a mix of strings and numbers. This makes me wonder whether we should try to fold this functionality into the existing flattened field, or start thinking about whether we could have a sort of wrapper that could redirect fields to either flattened or its numeric variant at both index and search time, e.g. something like that:

{
  "foo": {
    "type": "flattened",
    "numeric_field_pattern": [ "*.count" ]
  }
}

so that an object like

{
  "foo": {
    "tags": [ "x", "y" ],
    "count": 42
  },
  "bar": {
    "tags": [ "x" ],
    "count": 100
  }
}

would have its foo.tags/bar.tags fields indexed and searched with flattened while the foo.count/bar.count fields would be indexed and searched with the numeric variant.

@polyfractal brought up the good point that in some telemetry use cases, all values represent counts. This type of data is similar to a histogram, but with labeled buckets. For example, we could be tracking the usage of every aggregation:

{
  "agg_usage": {
    "terms": 101,
    "date_histogram": 2450,
    ...
  }
}

It would be natural to perform a histogram-like aggregation on agg_usage to sum up the counts for each entry terms, date_histogram, etc. When designing the feature, it'd be good to keep this case in mind -- for example, it could affect whether we want to distinguish long counts vs. arbitrary numerics.

it could affect whether we want to distinguish long counts vs. arbitrary numerics

I similar fashion this feature might be useful for ML use cases. It seems to me that being able to specify the sub-type (long, float, double, ...) would be good. For ML these vectors can become huge, but on the other side don't require necessarily a double. Being able to define the sub-type (e.g. float) would be a way to choose between precision and space.

Was this page helpful?
0 / 5 - 0 ratings