Our default dynamic mappings rules create both a text and a keyword field whenever they hit a JSON string:
{
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
And over the years, many clients implemented similar logic:
keyword
field,keyword
field,text
field.Is it logic that we should embed in Elasticsearch? Maybe we can find better ideas, but here is a proposal to get the discussion started:
exact_match
query, which tries to match against the whole string. It fails for text
fields and has the same behavior as match
on keyword
, numbers, ...text_keyword
field, which is essentially a wrapper around a text
and a keyword
field. Running aggregations or an exact_match
query against this field use the sub keyword
field while match
, query_string
, multi_match
and simple_query_string
queries use the text
field.text
+ sub keyword
mapping.Pinging @elastic/es-search (:Search/Mapping)
Relates to #53020
Would the solution put into place to address text/keyword multi-fields also handle wildcards? In discussions I've seen it assumed that it would, but I wanted to clarify.
Hopefully wildcard
is going to be less subject to this issue, as I can't think of many reasons to map a field both as wildcard
and keyword
, since they support the same operations, or wildcard
and text
, since wildcard
can do infix search already. That said, if you have plans to use multi-fields with a wilcard
field, we'd be interested to know more.
ECS has some fields that are mapped as keyword
/ text
that we plan to migrate as keyword
. I believe we'll need to think about dedicated migration paths for this case since removing the text
multi-field would be a breaking change. For instance, one idea that has been raised is whether wildcard
fields could create a virtual text
subfield that emulates the behavior of a text
field to ease this transition. This requires more thought and we might end up with a different approach but I thought sharing this example would help explain the kind of solution that we're considering.
Are wildcard
fields just a superset of keyword
in terms of features? I may have been mistaken, but I thought simply switching from keyword to wildcard would result in feature loss for aggregations and such. Admittedly I don't have a concrete example.
@epixa As we've been thinking about how to migrate from existing mappings to wildcard
, we agreed that it would be much easier if wildcard
supported the same operations keyword
using the same semantics. It supports match
queries, terms
aggregations, sorting, etc. with the same semantics as keyword
. However it takes a different approach to indexing that makes it slower at exact queries or aggregations, but faster at wildcard/regexp queries on high-cardinality fields.
I'd like to point out that ECS went with the reverse convention, on how to index strings. Since ECS started around monitoring, rather than full text search, the default datatype is keyword
for string fields. Then only a few fields have a .text
multi-field added (less than 20, iirc).
I'm pointing this out because here we're talking about potentially building a shorthand notation that encodes the Elasticsearch default. As the proposal stands, it couldn't be used by users who are trying to build ECS-compatible indices.
- Update default dynamic mappings to create this field for strings instead of the current text + sub keyword mapping.
I'm not sure I understand the 3rd point in the body of the issue. "This field": are we talking about wildcard?
@webmat The idea here would be that ECS would not need to define multi-fields at all. ECS would define the field type as text_keyword
(or whatever name we come up with) with no multi-fields. Internally Elasticsearch would handle the fact that there is a text field and a keyword field underlying the field type but to the user it would appear as one field with no multi-fields. The idea here is that users should not need to worry about whether they need the keyword or text form of the field and so just reference the field in one way and Elasticsearch should figure out which underlying field (text or keyword) is the right one to use (so whether its used in an aggregation, in a free text query, an exact match query, etc.)
This makes sense, and would indeed be a good simplification. But this would force all string fields defined this way to be indexed both ways?
ECS followed the Beats convention of trying to do keyword
only as much as possible, for performance reasons.
Instead of introducing a new exact_match
query, can we use term
query for exact matching and make term
query fail on text fields with a message like "use match query instead"?
I like the idea of having a new text_keyword
field, which is a wrapper around a text and a subfield keyword
field and queries/aggs are delegated to one of those fields automatically.
Another way to organize text_keyword
is to index both the exact form and tokenized form into the same Lucene field. This is inspired by the new wildcard
field. The exact form can be surrounded by some fake symbols, e.g. "0
this would force all string fields defined this way to be indexed both ways
Only if you want to support both text search (provided by text
) and exact match/sorting/aggregations (provided by keyword
).
can we use term query for exact matching and make term query fail on text fields with a message like "use match query instead"?
We could.
index both the exact form and tokenized form into the same Lucene field
The space saving idea is interesting. I wonder if that would cause problems. Preserving support for scoring and multi-term queries would be challenging but I believe it could work?
A problem with the proposal of this issue that we identified when discussing the wildcard
field is that we might need variants of the fuzzy, wildcard, regexp, ... queries as well if we want to be consistent in the way that we treat matching individual tokens vs. full values, which might not be scalable.
A problem with the proposal of this issue that we identified when discussing the wildcard field is that we might need variants of the fuzzy, wildcard, regexp, ... queries as well if we want to be consistent in the way that we treat matching individual tokens vs. full values, which might not be scalable.
This, indeed a substantial problem, and makes this proposal not worth it.
fuzzy, wildcard, regexp
Speaking of these queries, if we go with a text field and a keyword
subfield, do these queries apply only to keyword
subfield? or it will be applied across two fields (boolean OR/multi-match)?
Preserving support for scoring
It seems that most queries used in observability solutions are not concerned about textual scoring, but only filtering.
Another idea to optimize space could be to have a text field and a keyword
subfield as we planned, always index a field value to the keyword field, but only index it into the text field only if its analyzed version is different.
Speaking of these queries, if we go with a text field and a keyword subfield, do these queries apply only to keyword subfield? or it will be applied across two fields (boolean OR/multi-match)?
This is the question that helped us discover this problem. :)
It seems that most queries used in observability solutions are not concerned about textual scoring, but only filtering.
Agreed.
We had a team discussion, and we are in favour to proceed with a text_keyword
field:
doc_values
mapping parameter will be applied to the internal keyword field and can be disabledindex_options
will be applied to the text fieldenabled
parameter will be applied to both fields (if a user need to disable keyword or text, they should use traditional text/keyword fields).Some things still left for the discussion:
significant_terms
agg that can potentially be run both on text or keyword field? should we run it only on the keyword field?what about significant_terms agg that can potentially be run both on text or keyword field? should we run it only on the keyword field?
The significant_text
agg is designed for text
fields (typically based on samples of top hits using sampler
agg, re-analyzes source on the fly).
The significant_terms
agg typically targets keyword
fields (accessing doc values.). It can target text
fields too but we advise against it because it requires fielddata.
I'd be happy to see us formalise these patterns by making them only target their respective field types.
I had a couple questions about the proposal:
text_keyword
being useful. The newly-introduced wildcard
field might cover many cases in which the text + keyword multifield was previously helpful. (Perhaps we just really like the out-of-the-box experience it provides and want to use text_keyword
as the dynamic mapping type for strings?)text
field type, something like index_exact
?I think @jtibshirani makes a good point. In ECS we added a .text
multi-field to some keyword fields as a way to work around not having wildcard
and query-time case insensitivity. When both of these are widely available, I think the need to for these multi-fields in ECS will mostly go away.
One place I think indexing both as keyword
and text
is still very useful is for initial dataset exploration, however.
Another interesting point that @jtibshirani is bringing forth, and which I've been thinking about as well, is the possibly overwhelming growing list of field types we offer. I _think_ I understand the need for the new field types, and I've been loosely monitoring their addition more or less from an user point of view (finding out a new field type is being added, on the surface understanding the need for it and then just looking at the docs). And there are a lot of "specialized" field types out there, the later ones being added more from an "internal" usage need imho.
I'd be curious if anyone else is thinking the same ^ and if we could better handle the way users look at our growing list of field types in the future. An example would be the way we document the field types. Now, almost all non-core field types are under the "specialized" section. I would argue that the IP field, for example, shouldn't be in the "specialized" section, but maybe in the "core" one. It has a long history, it's fairly easy to understand and it doesn't require an edge case scenario to be used (like it happens with most of the other specialized field types). I would even push this further and suggest a new field types section - "Advanced", maybe - where _flattened_, _constant_keyword_, _histogram_... should be moved.
For me the biggest shift I've seen in requirements is the move away from traditional ideas of indexing human-authored text to indexing machine-generated text.
Traditionally tokenisation was useful normalisation that did all of the following:
1) Breaking strings into words by splitting on punctuation
2) Lowercasing
3) Removing plurals, past tense etc (ie stemming)
4) Injecting useful synonyms.
While useful on prose, none of the above is helpful when searching stacktraces, weblogs etc.
We just want matching on arbitrary parts of character sequences - which is where the wildcard field comes in. It marks a break with token-based matching. Users no longer need to think of the indexed terms defined by a choice of Analyzer (whose logic is often a black box to most).
This distinction between indexing for prose and indexing for exact-matching is perhaps the biggest change to reflect in our mapping choices.
@astefan I also think improving the documentation on field types could be a big help to our users. I filed #57548 based on your thoughts -- we can continue the discussion there to keep this issue focused on text/ keyword mappings.
I'm curious about example cases where we see text_keyword being useful. The newly-introduced wildcard field might cover many cases in which the text + keyword multifield was previously helpful. (Perhaps we just really like the out-of-the-box experience it provides and want to use text_keyword as the dynamic mapping type for strings?)
@jtibshirani Indeed, wildcard
is very useful for ECS and logging solutions, but being special "keyword" type, it doesn't deal with full-text search. I think with the main goal of a new text_keyword
field is to substitute text/keyword multi-field for a dynamic string mapping (as you correctly noticed). The benefits of this are following:
I think with the main goal of a new text_keyword field is to substitute text/keyword multi-field for a dynamic string mapping (as you correctly noticed). The benefits of this are following...
Thanks @mayya-sharipova, this makes sense! If this is the main use case it would be nice to verify that we plan to make this change (starting to use a combined text/keyword for dynamic string mappings, instead of say switching to wildcard
).
We had a discussion within the search team and have decided the following:
I am closing this issue because:
text_keyword
field less attractive. We may later re-consider text_keyword
field type if it brings savings in space.I am reopening this issue since we think that the feature request is still valid and could be beneficial for some use cases.
The dynamic mapping that creates two fields (one text and one sub-field named keyword) can be confusing for users so we'd prefer to have a single field that knows how to behave depending on the context. That would be simpler than exposing multi-field informations in field_caps since Kibana or SQL for instance wouldn't need to implement any logic.
They would just pass the new field and Elasticsearch would apply different extraction logic based on the context (exact match query, full text query, aggregations, ...).
Most helpful comment
@webmat The idea here would be that ECS would not need to define multi-fields at all. ECS would define the field type as
text_keyword
(or whatever name we come up with) with no multi-fields. Internally Elasticsearch would handle the fact that there is a text field and a keyword field underlying the field type but to the user it would appear as one field with no multi-fields. The idea here is that users should not need to worry about whether they need the keyword or text form of the field and so just reference the field in one way and Elasticsearch should figure out which underlying field (text or keyword) is the right one to use (so whether its used in an aggregation, in a free text query, an exact match query, etc.)