Elasticsearch: Ngram/Edgengram filters don't work with keyword repeat filters

Created on 6 Jan 2017 · 15Comments · Source: elastic/elasticsearch

Elasticsearch version: 2.3.3

Plugins installed: [analysis-icu analysis-smartcn delete-by-query lang-javascript whatson analysis-kuromoji analysis-stempel elasticsearch-inquisitor head langdetect statsd/]

Description of the problem including expected versus actual behavior:

I want to index edgengrams from 3 to 15 chars, but also keep the original token in the field as well. This is being used for search as you type functionality. For both speed and relevancy reasons we've settled on 3 being the min num of chars that makes sense, but it leaves some gaps for non-whitespace separated languages and for words like 'pi'.

I thought I could do this using keyword_repeat and unique filters in my analyzer, but that doesn't seem to work with edgengram filters. Maybe I'm doing it wrong, but I haven't come up with a workaround yet.

Steps to reproduce:

PUT test_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "edgengram_analyzer": {
          "filter": [
            "icu_normalizer",
            "icu_folding",
            "keyword_repeat",
            "edgengram_filter",
            "unique_filter"
          ],
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        },
        "default": {
          "filter": [
            "icu_normalizer",
            "icu_folding"
          ],
          "type": "custom",
          "tokenizer": "icu_tokenizer"
        }
      },
      "filter": {
        "unique_filter": {
          "type": "unique",
          "only_on_same_position": "true"
        },
        "edgengram_filter": {
          "type": "edgeNGram",
          "min_gram": "3",
          "max_gram": "15"
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "my_text": {
          "type": "string",
          "similarity": "BM25",
          "analyzer": "default",
          "fields": {
            "ngram": {
              "type": "string",
              "term_vector": "with_positions_offsets",
              "similarity": "BM25",
              "analyzer": "edgengram_analyzer",
              "search_analyzer": "default"
            },
            "word_count": {
              "type": "token_count",
              "analyzer": "default"
            }
          }
        }
      }
    }
  }
}

GET test_analyzer/_analyze 
{
  "analyzer": "edgengram_analyzer", 
  "text":     "Is this déjà vu?"
}

Output:

{
  "tokens": [
    {
      "token": "thi",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "this",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "dej",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    },
    {
      "token": "deja",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    }
  ]
}

I'd expect to get the tokens: is, thi, this, dej, deja, vu

The problem gets worse when looking at non-whitespace languages where many characters are tokenized into one character per token.

I could search across multiple fields, but that prevents me from matching on phrases and using those phrase matches to boost results. For instance if the user types in "hi ther" we should be able to match instances where the content had "hi there" and use that to boost those exact matches. We do this by adding a simple should clause:

            "bool": {
              "must": [
                {
                  "multi_match": {
                    "fields": [
                      "mlt_content.default.ngram"
                    ],
                    "query": "hi ther",
                    "operator": "and",
                    "type": "cross_fields"
                  }
                }
              ],
              "should": [
                {
                  "multi_match": {
                    "type": "phrase",
                    "fields": [
                      "mlt_content.default.ngram"
                    ],
                    "query": "hi ther"
                  }
                }
              ]
            }
          },

:SearcAnalysis >enhancement

Source

gibrown

All 15 comments

I'd expect to get the tokens: is, thi, this, dej, deja, vu

Since you use the icu_tokenizer your text is being split into four tokens:

{
  "tokens" : [ {
    "token" : "Is",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "this",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "deja",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "vu",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

And then if you do the folding and keyword_repeat:

{
  "tokens" : [ {
    "token" : "is",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "is",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "this",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "this",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "deja",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "deja",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "vu",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "vu",
    "start_offset" : 13,
    "end_offset" : 15,
    "type" : "<ALPHANUM>",
    "position" : 3
  } ]
}

If you then try to do edge ngrams for the is and vu terms, they are below
the min_gram threshold of 3 in your configuration, so they are dropped.

If you want to keep the whitespace, perhaps inject a shingle token filter (with
two shingles) in there so that is this becomes a token including the
whitespace, which you can then analyze with edgengrams to get is, is t, is th, is this.

dakrone on 6 Jan 2017

I've run into this same issue before. keyword_repeat only works for stemmers, but I wonder if this functionality should be extended to edge-ngrams.

@mikemccand what do you think?

clintongormley on 9 Jan 2017

FWIW, my current work around is to do always use the lang specific analysis field when I think I'm searching in a non-whitespace separated language (but I don't really trust my lang detection), and I use the lang specific field anytime the text is less than 3 chars, or if the trailing word is less than three chars (eg a search like "math pi" ).

Shingle tokens as a work around will still have the same problem of not letting me have sub 3 char tokens i think. I also suspect it would blow up the index size even more than including 1 and 2 char edgengrams would.

BTW, if we change this, can it be easily back ported to 2.x ;)

gibrown on 9 Jan 2017

Heh, and now we've found a case where my workarounds don't work: "Game of Thrones".

Any updates here? I guess I could just adjust to edgengrams starting from 1 char just seems likely to cause lots of inefficiencies.

Shingle tokens sounds interesting (and maybe improves relevancy) but will also significantly increase index size.

gibrown on 20 Apr 2017

👍1

Another idea (for anyone following along). I could have one edgegrams field per language and then specify a language analyzer that has stop words for that language. Would fix the worst cases, but still not fix something like "pi".

gibrown on 20 Apr 2017

@gibrown Can you please confirm what tokens do you expect when you index ""Is this déjà vu?"
Are you expecting ngrams (3-15) as well?

"Is t"
"Is th"
...
Can you index using edge ngram tokenizer?
And if you need original tokens as well, can you use another field for this?

keyword_repeat is specifically designed to be followed by some stem filter. It is not relevant for edge ngrams.

mayya-sharipova on 20 Mar 2018

❤1

cc @elastic/es-search-aggs

mayya-sharipova on 20 Mar 2018

For edgengrams on "Is this déjà vu?" I would only expect the following tokens:
"is","thi","this","dej","deja","vu"

"is t" and "is th" would not be in the index.

Can you index using edge ngram tokenizer?

No we are using the icu_tokenizer. We are doing indexing across all languages. Technically we should even be using special tokenization for Japanese, Korean, and Chinese so we can get the tokenization correct there.

Thanks for taking a look.

Our workaround that we have deployed is to search both the edgengram field and an icu tokenized field that doesn't have any ngrams. We do this with a multi_match query that uses the cross_fields and AND as the operator. Makes for a more expensive query but it kinda works.

gibrown on 20 Mar 2018

@gibrown If you found the workaround, would you mind if I close this issue?

mayya-sharipova on 20 Mar 2018

I still think that some way to index edgengrams from X-Y plus also the original token would be a very worthwhile improvement. I would use it if it were available. I still think the keyword_repeat is the closest approximation. My workaround breaks if i am trying to do a phrase match. For instance: "is this dej"

Technically what I would love is a clearer language that lets me have multiple flows for extracting tokens:

extract the original token
extract a stemmed version of the token
extract the edgengrams for a token.

This lets me do an AND match on multiple tokens as well as a phrase match. Having them be in multiple fields has a number of drawbacks.

gibrown on 20 Mar 2018

I've been doing some work on making branches possible in TokenStreams (see https://issues.apache.org/jira/browse/LUCENE-8273). If that were combined with a generalisation of KeywordRepeatFilter, we could build an analysis chain that looked something like:

KeywordRepeatFilter(none, stem, ngram) -> repeats each token three times with a different keyword set
if (keyword == stem) then apply Stemmer
if (keyword == ngram) then apply EdgeNGramFilter

romseygeek on 21 May 2018

@romseygeek I love the idea of being able to have multiple paths of processing tokens. This would help in a lot of cases I've seen I think.

It feels like the analysis syntax would need a bit more structure than it currently has to handle this sort of thing.

gibrown on 5 Jun 2018

We had exactly the same issue, problem is that not all filters support the keyword attribute. We ended up adding a new Token filter in a plugin we maintain to work around this limitation.
It would be great to to have such support upstream (either by making all filters aware the keyword attribute or by providing another way to really emit the original token).