Elasticsearch: sub keyword field to string dynamic mappings - name and intent discussion

Created on 7 May 2016 · 13Comments · Source: elastic/elasticsearch

As discussed with @jpountz in https://github.com/elastic/elasticsearch/pull/17188#issuecomment-215742185 opening up a separate ticket for discussion here.

Some items for consideration:

Defaulting this way will continue to pattern of users seeing increased disk utilization out of the box as they upgraded versions of elasticsearch
By using keyword for the multi-field name we are tightly coupling it to what tokenizer is used. For example if we every rename the keyword tokenizer to noop (which I would love to see since it more accurately describes what it does and also is how we tend to explain it to folks) then the multi-field option.

:SearcMapping >docs

Source

djschny

All 13 comments

In the original issue (https://github.com/elastic/elasticsearch/issues/12394) I went into great detail to explain the reasoning behind this change, but to address your questions here:

Defaulting this way will continue to pattern of users seeing increased disk utilization out of the box as they upgraded versions of elasticsearch

In the past, the string field could be used for full text search and for aggregations, by loading all the terms into the heap in fielddata. The behaviour of these fields depended largely on the type of value that was specified, eg "The quick brown fox..." implied the use of full text search (but not aggregations or sorting), while "London" might be a single identifier used for single-term lookups, aggregations and sorting. But "New York", which is probably intended for the second use case could actually only be used for the first.

We can't deduce which use case a user intends when we receive a string field - it could be either. The solution for this is to provide a main text field for full text search (with fielddata disabled so that users don't unwittingly flood their heap by trying to run aggregations or sorting on that field), and a sub-field of type keyword for the single-term lookup, sorting, and aggregations use case.

The benefit of this is that, without any config, you get both access patterns for string fields out of the box. The downside is that you index string values twice.This is exactly the same pattern that Logstash has used for string fields for a long time so users of Logstash are unlikely to see any change.

It is very easy to optimize disk space usage here: just map your fields as text or keyword or add a dynamic mapping for textwhich specifies whether a field should be only text or only keyword.

By using keyword for the multi-field name we are tightly coupling it to what tokenizer is used. For example if we every rename the keyword tokenizer to noop (which I would love to see since it more accurately describes what it does and also is how we tend to explain it to folks) then the multi-field option.

No we aren't. This field is not named after the keyword analyzer, it is named after the field type keyword. The field type got its name in the same way as the keyword analyzer did: we don't want full text, we want to treat this value as a single keyword. What other name would your recommend to describe the datatype for this field?

And keyword fields in the future will not be restricted to the keyword analyzer. We will add support for limited analysis which allows, eg lowercasing or performing unicode normalization, or unicode collations.

For me, the only debate is whether this sub-field should be called keyword or raw, which is the name used today in Logstash. For bwc, raw would probably be better, but I think that keyword is more descriptive. My current feeling is that we should continue to use keyword. Logstash is free to keep their index template which uses raw instead.

clintongormley on 7 May 2016

👍1

+1 to what Clinton said. The fact that we did not map strings both for text search and keyword search/aggs in the past caused bad out-of-the-box experiences since you almost certainly had to reindex once you realized that you could not aggregate on whole string values.

Regarding disk usage, it will be higher with default mappings for sure, but the problem is mitigated by the use of ignore_above: 256. There is a trade-off for sure, but I think having to reindex to run aggregations is more disappointing than higher-than-expected disk usage.

However I'm also open to changing the name to either raw like logstash or original like @rjernst suggested. I have a slight preference for keyword though.

jpountz on 8 May 2016

Discussed it in Fix it Friday - we prefer the keyword field. Logstash can continue to use raw with dynamic templates, should they so choose.

I will improve the docs to explain that we're optimising for the OOB experience, but disk usage can be improved with some simple mappings.

clintongormley on 13 May 2016

What other name would your recommend to describe the datatype for this field?

not_tokenized

djschny on 19 May 2016

👎2

Logstash can continue to use raw

Much of the road to 5.0 has been a theme of consistency. We've used raw for a long long time, and now are suddenly calling this thing keyword -- this is inconsistent. Logstash should not keep inconsistency and is looking at fixing that very soon, which is why I'm here talking about our new friend keyword. I do not believe Logstash can continue using raw because after 5.0 this becomes a user experience problem that ES uses keyword for strings where Logstash uses raw.

That said, for me personally, keyword is the wrong name. "United States" is two _words_, "San Jose Sharks" is three _words_, and yet the keyword name implies a singular word. A user agent string is even further something I would consider a keyword and yet I use Logstash's raw feature to allow me to do aggregations on user agents. My chief concerns on naming things is about how much I expect it to confuse users.

With the hands-on-workshop, I teach people about analyzers/tokenizers by showing what happens to a string by default in Elasticsearch, then we talk about treating these entire strings as a single field value (or "term"). Because we're on the topic of analyzers, it is easy to say "We solve this by using this thing called not_analyzed, and logstash calls this field the 'raw' value". It is early for this keyword feature, but I have trouble coming up with such a story for teaching.

jordansissel on 4 Aug 2016

And raw is a shorter name :)

I think consistency is a good point here.

But I'd like to be able to apply some token filters on this type of fields at some point so I don't think that having "raw" + an analyzer would make sense in term of meaning.
"Keyword" + an analyzer has more meaning IMO.

I think we should mark this discussion as a blocker for the next release because it will be hard to change after we released the beta.

dadoonet on 5 Aug 2016

👍1

I've been thinking the past few days how to find a way to convince myself that keyword is the right name. Here is the story on how I can explain to myself why keyword might be the right name:

I thought keyword was poor because I view Elasticsearch field mappings as a way to say "The data is of this type". This worked well for me to understand and explain various obvious-to-me data types in Elasticsearch such as dates, longs, floats, strings, etc.

In this model, I was telling Elasticsearch _what the data is_, and trying to distinguish strings vs keyword vs text was not fitting my mental model.

The Elasticsearch documentation on mappings says this:

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.

In this description, it seems that the mapping is presented as _how_ Elasticsearch uses the data, not _what_ the data is. If I view things with the _how_ in mind, instead of the _what_, I think keyword makes sense -- I can tell Elasticsearch _how_ to treat something like "United States" (such as text or keyword).

The above explanation may be confusing, but I think I can use this model -- _how_ instead of _what_ -- to tell stories in trainings, etc, about reasons for using text vs keyword. "Treat it as a keyword", for example.

I am still nervous about the difficult schema change this will require on the Logstash side; in the battle for consistency, Logstash will want to change the multifield .raw to match what Elasticsearch uses: .keyword.

jordansissel on 8 Aug 2016

If this proves to be a challenge to logstash, I'd personally be ok with keeping the field called keyword but having it named xxx.raw in the default mappings. Am I right to assume this is something you'd be happy with?

jpountz on 9 Aug 2016

There are a lot of users with massive amounts of data ingested through Logstash where the current .raw field convention is used. Changing the default from .raw has the potential to unnecessarily break a lot of systems and cause problems for users using the default templates or custom index templates based on these. Please take this into consideration before deciding to change the existing .raw field naming convention.

cdahlqvist on 9 Aug 2016

@cdahlqvist We're discussing the options and impacts of .raw vs .keyword over on https://github.com/logstash-plugins/logstash-output-elasticsearch/pull/462

I have a rough draft of a proposal here: https://github.com/logstash-plugins/logstash-output-elasticsearch/pull/462#issuecomment-238376557

jordansissel on 10 Aug 2016

@jpountz I'd be OK having ES's default to xxx.raw, yes. The benefit there is to not divide users across the release boundary of 5.0 (new users and old users would both get .raw if we did this)

jordansissel on 10 Aug 2016

@jordansissel I agree with the conclusion you reached in https://github.com/elastic/elasticsearch/issues/18195#issuecomment-238372868 and I think that keyword is fundamentally the right name for this field (including for the reasons cited in https://github.com/elastic/elasticsearch/issues/18195#issuecomment-237749646). Long term it makes the purpose of the field easier to explain.

While I'm not completely against keeping the field as raw, I think that (unfettered by history) we'd choose keyword today instead.

All that said, I obviously recognise that this makes for a painful transition in Logstash. I don't have great suggestions for how to make this easier, but the options are probably as follows:

New users - use keyword from the outset
Existing users with custom templates - most of these won't be much impacted
Existing users with short retention periods - could use raw and keyword for the duration of the transition
Existing users with long retention periods - could change the template to just use raw going forwards

clintongormley on 10 Aug 2016

👍1

+1 clint's comments and keeping 'keyword'.

I think we can help users through this period of transition. It may be
hard, but I think it's the right direction.

On Wednesday, August 10, 2016, Clinton Gormley [email protected]
wrote:

@jordansissel https://github.com/jordansissel I agree with the
conclusion you reached in #18195 (comment)
https://github.com/elastic/elasticsearch/issues/18195#issuecomment-238372868
and I think that keyword is fundamentally the right name for this field
(including for the reasons cited in #18195 (comment)
https://github.com/elastic/elasticsearch/issues/18195#issuecomment-237749646).
Long term it makes the purpose of the field easier to explain.

While I'm not completely against keeping the field as raw, I think that
(unfettered by history) we'd choose keyword today instead.

All that said, I obviously recognise that this makes for a painful
transition in Logstash. I don't have great suggestions for how to make this
easier, but the options are probably as follows:

New users - use keyword from the outset

Existing users with custom templates - most of these won't be much
impacted

Existing users with short retention periods - could use raw and
keyword for the duration of the transition

Existing users with long retention periods - could change the
template to just use raw going forwards

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/18195#issuecomment-238841499,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAIC6vUIeZey6EZJHL9KAaDqxjsgRugYks5qebhpgaJpZM4IZVF6
.