Elasticsearch: Support analyzer for keyword type

Created on 29 Apr 2016 · 12Comments · Source: elastic/elasticsearch

Sometimes you want to analyze text to make it consistent when running aggregations on top of it.

For example, let's say I have a city field mapped as a keyword.

This field can contain San Francisco, SAN FRANCISCO, San francisco...

If I build a terms aggregation on top of it, I will end up with

San Francisco: 1
SAN FRANCISCO: 1
San francisco: 1

I'd like to be able to analyze this text before it gets indexed. Of course I could use a text field instead and set fielddata: true but that would not create doc values for this field.

I can imagine that we allow an analyzer at index time for this field.

We can restrict its usage if we wish and only allows analyzers which are using tokenizers like lowercase, keyword, path but I would let the user decide.

If we allow setting analyzer: simple for example, my aggregation will become:

san francisco: 3

Same applies for path tokenizer.

Let say I'm building a dir tree like:

/tmp/dir1/file1.txt
/tmp/dir1/file2.txt
/tmp/dir2/file3.txt
/tmp/dir2/file4.txt

Applying a path tokenizer would help me to generate an aggregation like:

/tmp/dir1: 2
/tmp/dir2: 2
/tmp: 4

:SearcMapping >enhancement discuss

Source

dadoonet

👍31

Most helpful comment

@wgerlach : I've added an example for lowercase/asciifolding normalizer on elastic forum: https://discuss.elastic.co/t/wildcard-case-insensitive-query-string/75050/5

fabiocatalao on 23 Mar 2017

👍6

All 12 comments

Most of the work needed to implement this feature has been merged into Lucene and will be available in 6.2. Analyzer got a new method called normalize that only applies the subset of the analysis chain that is about normalization (and not eg. stemming) https://issues.apache.org/jira/browse/LUCENE-7355.

Note that it would NOT work for the path tokenization use-case mentioned above since it has a restriction that it can generate a single token, so such use-cases would have to be handled differently, eg. using an ingest processor.

I am wondering if we should use a different property name than analyzer since the analyzer will not be used for tokenizing. I am currently thinking about:

"my_field": {
  "type": "keyword",
  "normalizer": "standard"
}

This would avoid potential confusion about what happens with analyzers that would generate multiple tokens and make clearer that only normalization would be applied?

jpountz on 13 Jul 2016

Note that it would NOT work for the path tokenization use-case mentioned above since it has a restriction that it can generate a single token, so such use-cases would have to be handled differently, eg. using an ingest processor.

That would complicate the process but I guess we have to live with that. At least, we have a workaround.

I am wondering if we should use a different property name than analyzer since the analyzer will not be used for tokenizing.

Totally agree.

dadoonet on 13 Jul 2016

Instead of calling it a "normalizer", I'd call it by it's name token_filters and accept an array of token filters. I don't think analyzers should be used it here as they propose a use of a tokenizer.

synhershko on 2 Nov 2016

👍2

I think I agree with that. I initially thought that maybe integration with https://issues.apache.org/jira/browse/LUCENE-7355 would make sense, but maybe we should just apply a list of token filters manually, this would probably be simpler.

jpountz on 2 Nov 2016

👍3

Yeah, I think it's a much simpler approach than involving a queryparser here. No need for one IMO. Also please note that order matters in the token_filters array.

synhershko on 2 Nov 2016

What about character filters? They can also be useful here. My initial thought was to keep it as analyzers and to only allow analyzers which use the keyword tokenizer. But normalizers would work too...

clintongormley on 4 Nov 2016

hi guys, great to see you have an enhancement for this requirement!

Any idea how can I support case insensitive search on a "keyword" type field (which I also use for aggregations) for v5.0?

In ES 2.3 I used:
"analyzer_keyword": {
"tokenizer": "keyword",
"filter": "lowercase"
}

But that does not seem to work without enabling fielddata in ES 5.

Any workaround I can use for now?

ugolas on 9 Nov 2016

You can use ingest to lower case your field.

dadoonet on 9 Nov 2016

Hi Guys,
Since lucene added custom analyzer normalization with 6.2 release. https://issues.apache.org/jira/browse/LUCENE-7355
Just wondering whether this feature would be available soon in elasticsearch?
Our application makes heavy use of aggregations with lowercase filters to be used as doc-values.

sumithub on 1 Dec 2016

I understand from this thread that the ability has been added to sort case insensitive. But how? Is there documentation or an example available ?

wgerlach on 23 Mar 2017

👍2

@wgerlach : I've added an example for lowercase/asciifolding normalizer on elastic forum: https://discuss.elastic.co/t/wildcard-case-insensitive-query-string/75050/5

fabiocatalao on 23 Mar 2017

👍6

@wgerlach : I've added an example for lowercase/asciifolding normalizer on elastic forum: https://discuss.elastic.co/t/wildcard-case-insensitive-query-string/75050/5

Thanks a million!