Elasticsearch: token position problems of word_delimiter token filter

Created on 22 Aug 2014  路  7Comments  路  Source: elastic/elasticsearch

Hi, all

I got the following index setting

{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 0,
            "analysis": {   
                "analyzer": {
                    "fielda_index": {
                         "type": "custom",
                         "tokenizer": "icu_tokenizer",
                         "filter": [ "words_delimiter", "icu_normalizer", "icu_folding"]
                    },
                    "fielda_search": {
                         "type": "custom",
                         "tokenizer": "icu_tokenizer",
                         "filter": ["dot_delimiter", "icu_normalizer", "icu_folding"]
                    }
                },
                "filter": {
                    "dot_delimiter":
                    {
                        "type": "word_delimiter",
                        "generate_word_parts": true,
                        "generate_number_parts": true,
                        "split_on_case_change": false,
                        "preserve_original": true,
                        "split_on_numerics": true
                    },
                    "words_delimiter":
                    {
                        "type": "word_delimiter",
                        "generate_word_parts": true,
                        "generate_number_parts": true,
                        "split_on_case_change": true,
                        "preserve_original": true,
                        "split_on_numerics": true
                    }

                }
            }
        }
    },
    "mappings": {
        "main": {
            "_source": {"enabled": true},
            "dynamic_date_formats": ["basic_date_time_no_millis"],
            "properties": {
                "name": { "type": "string", "index": "analyzed", "index_analyzer": "fielda_index", "search_analyzer": "fielda_search", "include_in_all": true}
            }
        }
    }
}

And I use the word "PowerShot" to run the two analyzers, here is the result:

fielda_index:   PowerShot(1) Power(1) Shot(2)
fielda_search:  PowerShot(1)

The number inside the paren is the token position.
My question is why the token position of "Shot" is 2. I think the positions of the tokens that are generated by the word_delimiter token filter should be all the same. Ideas?

Because of this, I encounter an problem when performing match_phrase query.
We know the match_phrase query not only match the token but also check the token positions.

So when I insert a document,

{"name": "Canon PowerShot D500"}

I cannot using the query

{"from": 0, "size": 100, "query":{"match_phrase": {"name":"Canon PowerShot D500"}}}

to find the document I just inserted, because the token position is not matched.

The tokens result of the two analyzers are:

fielda_index    Canon(1) PowerShot(2) Power(2) Shot(3) D500(4) D(4) 500(5)
fielda_search   Canon(1) PowerShot(2) D500(3) D(3) 500(4)

Obviously, the position 3 of fielda_search is "D500", but the "D500" token of fielda_index locates at position 4. So it cannot be found the desired document.

The reproducible gist script is https://gist.github.com/hxuanji/b94d9c3514d7b08005d2

So are there any reason why the token position of the tokens that generated by word_delimter filter behave like these?
Since the extra tokens generated from word_delimiter are just "extended" cases of the original token, I think the position should remains to the original one. Do I misunderstand something or any other reasons?

Best,
Ivan

Most helpful comment

WordDelimiterGraphFilter is now released and available in v5.4. FYI to those who stumble upon this thread. thanks @mikemccand for this!!

V

All 7 comments

Hi @hxuanji

You are, unfortunately, correct. The WDF does generate new positions, which breaks the token filter contract. This is how it is in Lucene and currently there are no plans to change this in Lucene.

You can't use phrase queries with WDF.

You may be able to achieve what you want with the pattern capture instead.

Hi @clintongormley,

I have another question about it. Assume I modify the setting of the filters into:

"dot_delimiter":
                    {
                        "type" : "pattern_capture",
                        "preserve_original" : 1,
                        "patterns" : [
                          "([\\p{Ll}\\p{Lu}]+\\d*|\\d+)"
                       ]
                    },
                    "words_delimiter":
                    {
                        "type" : "pattern_capture",
                        "preserve_original" : 1,
                        "patterns" : [
                          "(\\p{Ll}+|\\p{Lt}+|\\p{Lu}+\\p{Ll}+|\\p{Lu}+)",
                          "(\\d+)"
                       ]
                    }

Now, the token position should be the same.
Now if I got the document:

{"name": "942430__n.jpg"}

Its token result of the two analyzers would be

fielda_index    942430__n.jpg(1) 942430(1) n(1) jpg(1)
fielda_search   942430__n.jpg(1) 942430(1) n(1) jpg(1)

As we see, the token positions are all located at pos 1.
But under this situation, I use the command:

{"from": 0, "size": 100, "query":{"match": {"name":{"query":"942430__n.jpg", "operator" : "and"}}}}

to query, but why does the result are included some documents whose tokens include only "n", such as {"name":"n" } ?

The reproducible gist: https://gist.github.com/hxuanji/8e58c0ffb391ced49439

Although I make sure the "and" operator, it seems only make sure the condition between "positions" not the "tokens". Does it make sense?

It seems I have some misunderstanding about the "matching rule" of "and" function.

Thanks a lot.

Hi @hxuanji

A trick for figuring out exactly what the query is doing is to use the validate-query API with the explain option:

curl -XPOST "http://localhost:9200/test/main/_validate/query?explain" -d'
{
  "query": {
    "match": {
      "name": {
        "query": "942430__n.jpg",
        "operator": "and"
      }
    }
  }
}'

This outputs:

     "explanation": "filtered(name:942430__n.jpg name:942430 name:n name:jpg)->cache(_type:main)"

So any of the terms in the same position are allowed. The and operator doesn't affect "stacked" terms. The reason for this is that these terms are like synonyms. You require one of the synonyms to be in position 0, but not all of them.

Hi, @clintongormley
I got it! Thanks for your help.

Ivan

@clintongormley I think this problem with the positions of the word_delimiter filter should be mentioned on the respective reference / guide pages... Just ran into the same thing.

I am trying to fix this issue in Lucene: https://issues.apache.org/jira/browse/LUCENE-7619

It would mean you need to include c:WordDelimiterGraphFilter (once it's released) in your search-time analyzer.

WordDelimiterGraphFilter is now released and available in v5.4. FYI to those who stumble upon this thread. thanks @mikemccand for this!!

V

Was this page helpful?
0 / 5 - 0 ratings