Elasticsearch: Add 'singleton' flag to field mappers?

Created on 25 Jun 2020  路  10Comments  路  Source: elastic/elasticsearch

Datastreams have a first-class concept of a timestamp field. Each document in the datastream must contain exactly one value for the designated timestamp field, so that we know where to route the document when partitioning by time. In #58582 we're adding index-time validation for this requirement. The current implementation is very narrow in scope and adds special document parsing logic just for datastream timestamps.

I was wondering if this validation could be useful more generally. We could add a 'singleton' flag to field mappers -- when it is set, the mapper will verify that it encounters one (and only one) value in each document:

"product_identifier": {
  "type": "keyword",
  "singleton": true
}

Such an option could be helpful for modeling fields that an application requires to have exactly one value, like identifiers, timestamps, content types, etc.

:SearcMapping >enhancement Search

All 10 comments

Pinging @elastic/es-search (:Search/Mapping)

I'd like to clarify one part:

Each document in the datastream must contain exactly one timestamp

I assume this sentence applies only to the field where singleton is set. In your example, @timestamp contains exactly one timestamp, and not an array of timestamps. This I'm totally on board with 馃憤

However the way this is currently worded makes it sound like no other date fields would be allowed (e.g. an event with @timestamp, event.start and event.end populated). I assume this is not the case here?

Stepping out of the timestamp example, I like this feature in general, not just for datastreams. Elasticsearch has long been flexible with fields containing either a single value or an array. However since ECS came out, we've gradually been clarifying which ECS fields are expected to contain an array of values, in order to make the event format more predictable for consumers of the data.

This new addition would help clarify 3 acceptable formats about a field:

  • it must be a single value
  • it must be an array of values
  • it's (still) unspecified (default Elasticsearch behaviour), but most likely a single value

However the way this is currently worded makes it sound like no other date fields would be allowed (e.g. an event with @timestamp, event.start and event.end populated). I assume this is not the case here?

Your assumption is right, I've updated the description to clarify the wording.

If I understand it correctly, constant_keyword today is already a singleton field? In general ++ on this feature?

If I understand it correctly, constant_keyword today is already a singleton field?

Not exactly, because documents are allowed to not specify a value for constant_keyword. When performing a search, we know to fill in the missing field with the constant value from the mappings. But you're right in that they appear like singleton fields from a search API perspective.

We discussed this in the search meeting and agree that it would be a useful addition to at least some of the field mappers. We think that allow_multi_fields:true|false might be a better parameter name to use as it's more obvious that we're not enforcing that a single value is present.

From the start I've interpreted this as a way to ensure a field would not contain an array of values. E.g. preventing the following:

{ "@timestamp": [ "1597425324", "1597425325" ] }

And enforcing that a field has a single value, e.g.:

{ "@timestamp": "1597425324" }

My understanding is that this feature is unrelated to whether or not a field has additional multi-fields.

I think @romseygeek meant allow_multiple_values:

  • when set true, we will allow multiple values
  • when set false, we will not allow multiple values. A value may be a single value, or it may not exist.

Or is the requirement with singleton is that every document must has this field and only a single value?

@romseygeek @mayya-sharipova the singleton flag would ensure there is exactly one value (as mentioned in the description), not just 'at most' one value. This is part of the usefulness of the feature -- for example with time-based data, we may want to ensure that every document has a timestamp and can be associated with a bucket based on time.

We discussed again as a team and agreed that this could be a useful feature. But we didn't see a strong immediate need for it yet, and will leave the issue open to gather more feedback on use cases and priority.

Other topics from our discussion:

  • We would probably want to return the flag in field caps. We would merge the flag across indices, only returning singleton: true if every index had singleton: true.
  • There may be a concern around over-use. Would users see the flag and start applying it everywhere, even when the default behavior would work fine?
Was this page helpful?
0 / 5 - 0 ratings

Related issues

jasontedor picture jasontedor  路  3Comments

clintongormley picture clintongormley  路  3Comments

ttaranov picture ttaranov  路  3Comments

makeyang picture makeyang  路  3Comments

ppf2 picture ppf2  路  3Comments