Elasticsearch: Formalize dual text/keyword mappings

Created on 5 Mar 2020 · 24Comments · Source: elastic/elasticsearch

Our default dynamic mappings rules create both a text and a keyword field whenever they hit a JSON string:

{
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword",
      "ignore_above": 256
    }
  }
}

And over the years, many clients implemented similar logic:

when an exact query is fired, use the keyword field,
when an aggregation is used, use the keyword field,
otherwise use the text field.

Is it logic that we should embed in Elasticsearch? Maybe we can find better ideas, but here is a proposal to get the discussion started:

Create a new exact_match query, which tries to match against the whole string. It fails for text fields and has the same behavior as match on keyword, numbers, ...
Create a new text_keyword field, which is essentially a wrapper around a text and a keyword field. Running aggregations or an exact_match query against this field use the sub keyword field while match, query_string, multi_match and simple_query_string queries use the text field.
Update default dynamic mappings to create this field for strings instead of the current text + sub keyword mapping.

:SearcMapping >feature Search

Source

jpountz

Most helpful comment

@webmat The idea here would be that ECS would not need to define multi-fields at all. ECS would define the field type as text_keyword (or whatever name we come up with) with no multi-fields. Internally Elasticsearch would handle the fact that there is a text field and a keyword field underlying the field type but to the user it would appear as one field with no multi-fields. The idea here is that users should not need to worry about whether they need the keyword or text form of the field and so just reference the field in one way and Elasticsearch should figure out which underlying field (text or keyword) is the right one to use (so whether its used in an aggregation, in a free text query, an exact match query, etc.)

colings86 on 16 Apr 2020

👍2

All 24 comments

Pinging @elastic/es-search (:Search/Mapping)

elasticmachine on 5 Mar 2020

Relates to #53020

jpountz on 10 Mar 2020

Would the solution put into place to address text/keyword multi-fields also handle wildcards? In discussions I've seen it assumed that it would, but I wanted to clarify.

epixa on 12 Apr 2020

Hopefully wildcard is going to be less subject to this issue, as I can't think of many reasons to map a field both as wildcard and keyword, since they support the same operations, or wildcard and text, since wildcard can do infix search already. That said, if you have plans to use multi-fields with a wilcard field, we'd be interested to know more.

ECS has some fields that are mapped as keyword / text that we plan to migrate as keyword. I believe we'll need to think about dedicated migration paths for this case since removing the text multi-field would be a breaking change. For instance, one idea that has been raised is whether wildcard fields could create a virtual text subfield that emulates the behavior of a text field to ease this transition. This requires more thought and we might end up with a different approach but I thought sharing this example would help explain the kind of solution that we're considering.

jpountz on 12 Apr 2020

Are wildcard fields just a superset of keyword in terms of features? I may have been mistaken, but I thought simply switching from keyword to wildcard would result in feature loss for aggregations and such. Admittedly I don't have a concrete example.

epixa on 13 Apr 2020

@epixa As we've been thinking about how to migrate from existing mappings to wildcard, we agreed that it would be much easier if wildcard supported the same operations keyword using the same semantics. It supports match queries, terms aggregations, sorting, etc. with the same semantics as keyword. However it takes a different approach to indexing that makes it slower at exact queries or aggregations, but faster at wildcard/regexp queries on high-cardinality fields.

jpountz on 14 Apr 2020

I'd like to point out that ECS went with the reverse convention, on how to index strings. Since ECS started around monitoring, rather than full text search, the default datatype is keyword for string fields. Then only a few fields have a .text multi-field added (less than 20, iirc).

I'm pointing this out because here we're talking about potentially building a shorthand notation that encodes the Elasticsearch default. As the proposal stands, it couldn't be used by users who are trying to build ECS-compatible indices.

Update default dynamic mappings to create this field for strings instead of the current text + sub keyword mapping.

I'm not sure I understand the 3rd point in the body of the issue. "This field": are we talking about wildcard?

webmat on 14 Apr 2020

colings86 on 16 Apr 2020

👍2

This makes sense, and would indeed be a good simplification. But this would force all string fields defined this way to be indexed both ways?

ECS followed the Beats convention of trying to do keyword only as much as possible, for performance reasons.

webmat on 16 Apr 2020

Instead of introducing a new exact_match query, can we use term query for exact matching and make term query fail on text fields with a message like "use match query instead"?

I like the idea of having a new text_keyword field, which is a wrapper around a text and a subfield keyword field and queries/aggs are delegated to one of those fields automatically.

Another way to organize text_keyword is to index both the exact form and tokenized form into the same Lucene field. This is inspired by the new wildcard field. The exact form can be surrounded by some fake symbols, e.g. "000". If exact and tokenized form are the same, we can only keep the exact value, which will allow us to save space for many solution fields that mostly single-valued.

mayya-sharipova on 5 May 2020

this would force all string fields defined this way to be indexed both ways

Only if you want to support both text search (provided by text) and exact match/sorting/aggregations (provided by keyword).

can we use term query for exact matching and make term query fail on text fields with a message like "use match query instead"?

We could.

index both the exact form and tokenized form into the same Lucene field

The space saving idea is interesting. I wonder if that would cause problems. Preserving support for scoring and multi-term queries would be challenging but I believe it could work?

A problem with the proposal of this issue that we identified when discussing the wildcard field is that we might need variants of the fuzzy, wildcard, regexp, ... queries as well if we want to be consistent in the way that we treat matching individual tokens vs. full values, which might not be scalable.

jpountz on 5 May 2020

A problem with the proposal of this issue that we identified when discussing the wildcard field is that we might need variants of the fuzzy, wildcard, regexp, ... queries as well if we want to be consistent in the way that we treat matching individual tokens vs. full values, which might not be scalable.

This, indeed a substantial problem, and makes this proposal not worth it.

fuzzy, wildcard, regexp

Speaking of these queries, if we go with a text field and a keyword subfield, do these queries apply only to keyword subfield? or it will be applied across two fields (boolean OR/multi-match)?

Preserving support for scoring

It seems that most queries used in observability solutions are not concerned about textual scoring, but only filtering.

Another idea to optimize space could be to have a text field and a keyword subfield as we planned, always index a field value to the keyword field, but only index it into the text field only if its analyzed version is different.

mayya-sharipova on 5 May 2020

Speaking of these queries, if we go with a text field and a keyword subfield, do these queries apply only to keyword subfield? or it will be applied across two fields (boolean OR/multi-match)?

This is the question that helped us discover this problem. :)

It seems that most queries used in observability solutions are not concerned about textual scoring, but only filtering.

Agreed.

jpountz on 6 May 2020

We had a team discussion, and we are in favour to proceed with a text_keyword field:

this will be a single field on the elasticsearch side
internally it will be mapped to two Lucene fields: keyword and text field.
all term-level queries (term, fuzzy, wildcard, prefix, range, new exact_match query) will be delegate to the internal keyword field
all full-text queries will be delegated to the internal text field
aggs will be run on the internal keyword field
doc_values mapping parameter will be applied to the internal keyword field and can be disabled
index_options will be applied to the text field
enabled parameter will be applied to both fields (if a user need to disable keyword or text, they should use traditional text/keyword fields).

Some things still left for the discussion:

should we allow user's access to the individual internal fields (for example a user wants to run a term query on the internal text field)?
what about significant_terms agg that can potentially be run both on text or keyword field? should we run it only on the keyword field?

mayya-sharipova on 28 May 2020

what about significant_terms agg that can potentially be run both on text or keyword field? should we run it only on the keyword field?

The significant_text agg is designed for text fields (typically based on samples of top hits using sampler agg, re-analyzes source on the fly).
The significant_terms agg typically targets keyword fields (accessing doc values.). It can target text fields too but we advise against it because it requires fielddata.

I'd be happy to see us formalise these patterns by making them only target their respective field types.

markharwood on 29 May 2020

👍1

I had a couple questions about the proposal:

I'm curious about example cases where we see text_keyword being useful. The newly-introduced wildcard field might cover many cases in which the text + keyword multifield was previously helpful. (Perhaps we just really like the out-of-the-box experience it provides and want to use text_keyword as the dynamic mapping type for strings?)
I wonder if we should be adding another string field type whose use + configuration overlaps a lot with existing ones. I'm worried that it's slowly becoming overwhelming for users to understand all the field types we offer and which ones to use to model their data. Did we consider alternatives like adding an option on the text field type, something like index_exact?

jtibshirani on 1 Jun 2020

👍1

I think @jtibshirani makes a good point. In ECS we added a .text multi-field to some keyword fields as a way to work around not having wildcard and query-time case insensitivity. When both of these are widely available, I think the need to for these multi-fields in ECS will mostly go away.

One place I think indexing both as keyword and text is still very useful is for initial dataset exploration, however.

webmat on 1 Jun 2020

Another interesting point that @jtibshirani is bringing forth, and which I've been thinking about as well, is the possibly overwhelming growing list of field types we offer. I _think_ I understand the need for the new field types, and I've been loosely monitoring their addition more or less from an user point of view (finding out a new field type is being added, on the surface understanding the need for it and then just looking at the docs). And there are a lot of "specialized" field types out there, the later ones being added more from an "internal" usage need imho.

I'd be curious if anyone else is thinking the same ^ and if we could better handle the way users look at our growing list of field types in the future. An example would be the way we document the field types. Now, almost all non-core field types are under the "specialized" section. I would argue that the IP field, for example, shouldn't be in the "specialized" section, but maybe in the "core" one. It has a long history, it's fairly easy to understand and it doesn't require an edge case scenario to be used (like it happens with most of the other specialized field types). I would even push this further and suggest a new field types section - "Advanced", maybe - where _flattened_, _constant_keyword_, _histogram_... should be moved.

astefan on 2 Jun 2020

For me the biggest shift I've seen in requirements is the move away from traditional ideas of indexing human-authored text to indexing machine-generated text.
Traditionally tokenisation was useful normalisation that did all of the following:

1) Breaking strings into words by splitting on punctuation
2) Lowercasing
3) Removing plurals, past tense etc (ie stemming)
4) Injecting useful synonyms.

While useful on prose, none of the above is helpful when searching stacktraces, weblogs etc.
We just want matching on arbitrary parts of character sequences - which is where the wildcard field comes in. It marks a break with token-based matching. Users no longer need to think of the indexed terms defined by a choice of Analyzer (whose logic is often a black box to most).

This distinction between indexing for prose and indexing for exact-matching is perhaps the biggest change to reflect in our mapping choices.

markharwood on 2 Jun 2020

👍1

@astefan I also think improving the documentation on field types could be a big help to our users. I filed #57548 based on your thoughts -- we can continue the discussion there to keep this issue focused on text/ keyword mappings.

jtibshirani on 2 Jun 2020

👍1

I'm curious about example cases where we see text_keyword being useful. The newly-introduced wildcard field might cover many cases in which the text + keyword multifield was previously helpful. (Perhaps we just really like the out-of-the-box experience it provides and want to use text_keyword as the dynamic mapping type for strings?)

@jtibshirani Indeed, wildcard is very useful for ECS and logging solutions, but being special "keyword" type, it doesn't deal with full-text search. I think with the main goal of a new text_keyword field is to substitute text/keyword multi-field for a dynamic string mapping (as you correctly noticed). The benefits of this are following:

Make automatic decisions what sub-field to use for term queries, full text queries, agg instead of asking this from a user in Kibana or through query DSL. This will lead to easier adoption of elasticsearch for new users.
- Easing pain to deal with multi-fields. New users get confused what is X.keyword field. Solutions ( we may still have some multifields in them even most fields are indexed as wildcards) have to make decisions which field of the multifields to use (e.g. SQL *, = operators), this decisions will be dealt on the es side.

mayya-sharipova on 3 Jun 2020

I think with the main goal of a new text_keyword field is to substitute text/keyword multi-field for a dynamic string mapping (as you correctly noticed). The benefits of this are following...

Thanks @mayya-sharipova, this makes sense! If this is the main use case it would be nice to verify that we plan to make this change (starting to use a combined text/keyword for dynamic string mappings, instead of say switching to wildcard).

jtibshirani on 15 Jun 2020

👍1

We had a discussion within the search team and have decided the following:

Use term-based queries for exact match for the keyword family
We are going to add multi-fields info into the fields caps. This should help to forward exact match queries to specific field types.

I am closing this issue because:

we don't plan to introduce a new exact_match query
with a new multi-fields info in the fields caps it will be easier to forward queries/aggs to the corresponding fields; this makes a proposal for a joint text_keyword field less attractive. We may later re-consider text_keyword field type if it brings savings in space.

mayya-sharipova on 18 Jun 2020

I am reopening this issue since we think that the feature request is still valid and could be beneficial for some use cases.
The dynamic mapping that creates two fields (one text and one sub-field named keyword) can be confusing for users so we'd prefer to have a single field that knows how to behave depending on the context. That would be simpler than exposing multi-field informations in field_caps since Kibana or SQL for instance wouldn't need to implement any logic.
They would just pass the new field and Elasticsearch would apply different extraction logic based on the context (exact match query, full text query, aggregations, ...).