Hi, all
I got the following index setting
{
"settings": {
"index": {
"number_of_shards": 5,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"fielda_index": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [ "words_delimiter", "icu_normalizer", "icu_folding"]
},
"fielda_search": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["dot_delimiter", "icu_normalizer", "icu_folding"]
}
},
"filter": {
"dot_delimiter":
{
"type": "word_delimiter",
"generate_word_parts": true,
"generate_number_parts": true,
"split_on_case_change": false,
"preserve_original": true,
"split_on_numerics": true
},
"words_delimiter":
{
"type": "word_delimiter",
"generate_word_parts": true,
"generate_number_parts": true,
"split_on_case_change": true,
"preserve_original": true,
"split_on_numerics": true
}
}
}
}
},
"mappings": {
"main": {
"_source": {"enabled": true},
"dynamic_date_formats": ["basic_date_time_no_millis"],
"properties": {
"name": { "type": "string", "index": "analyzed", "index_analyzer": "fielda_index", "search_analyzer": "fielda_search", "include_in_all": true}
}
}
}
}
And I use the word "PowerShot" to run the two analyzers, here is the result:
fielda_index: PowerShot(1) Power(1) Shot(2)
fielda_search: PowerShot(1)
The number inside the paren is the token position.
My question is why the token position of "Shot" is 2. I think the positions of the tokens that are generated by the word_delimiter token filter should be all the same. Ideas?
Because of this, I encounter an problem when performing match_phrase query.
We know the match_phrase query not only match the token but also check the token positions.
So when I insert a document,
{"name": "Canon PowerShot D500"}
I cannot using the query
{"from": 0, "size": 100, "query":{"match_phrase": {"name":"Canon PowerShot D500"}}}
to find the document I just inserted, because the token position is not matched.
The tokens result of the two analyzers are:
fielda_index Canon(1) PowerShot(2) Power(2) Shot(3) D500(4) D(4) 500(5)
fielda_search Canon(1) PowerShot(2) D500(3) D(3) 500(4)
Obviously, the position 3 of fielda_search is "D500", but the "D500" token of fielda_index locates at position 4. So it cannot be found the desired document.
The reproducible gist script is https://gist.github.com/hxuanji/b94d9c3514d7b08005d2
So are there any reason why the token position of the tokens that generated by word_delimter filter behave like these?
Since the extra tokens generated from word_delimiter are just "extended" cases of the original token, I think the position should remains to the original one. Do I misunderstand something or any other reasons?
Best,
Ivan
Hi @hxuanji
You are, unfortunately, correct. The WDF does generate new positions, which breaks the token filter contract. This is how it is in Lucene and currently there are no plans to change this in Lucene.
You can't use phrase queries with WDF.
You may be able to achieve what you want with the pattern capture instead.
Hi @clintongormley,
I have another question about it. Assume I modify the setting of the filters into:
"dot_delimiter":
{
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : [
"([\\p{Ll}\\p{Lu}]+\\d*|\\d+)"
]
},
"words_delimiter":
{
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : [
"(\\p{Ll}+|\\p{Lt}+|\\p{Lu}+\\p{Ll}+|\\p{Lu}+)",
"(\\d+)"
]
}
Now, the token position should be the same.
Now if I got the document:
{"name": "942430__n.jpg"}
Its token result of the two analyzers would be
fielda_index 942430__n.jpg(1) 942430(1) n(1) jpg(1)
fielda_search 942430__n.jpg(1) 942430(1) n(1) jpg(1)
As we see, the token positions are all located at pos 1.
But under this situation, I use the command:
{"from": 0, "size": 100, "query":{"match": {"name":{"query":"942430__n.jpg", "operator" : "and"}}}}
to query, but why does the result are included some documents whose tokens include only "n", such as {"name":"n" } ?
The reproducible gist: https://gist.github.com/hxuanji/8e58c0ffb391ced49439
Although I make sure the "and" operator, it seems only make sure the condition between "positions" not the "tokens". Does it make sense?
It seems I have some misunderstanding about the "matching rule" of "and" function.
Thanks a lot.
Hi @hxuanji
A trick for figuring out exactly what the query is doing is to use the validate-query API with the explain option:
curl -XPOST "http://localhost:9200/test/main/_validate/query?explain" -d'
{
"query": {
"match": {
"name": {
"query": "942430__n.jpg",
"operator": "and"
}
}
}
}'
This outputs:
"explanation": "filtered(name:942430__n.jpg name:942430 name:n name:jpg)->cache(_type:main)"
So any of the terms in the same position are allowed. The and operator doesn't affect "stacked" terms. The reason for this is that these terms are like synonyms. You require one of the synonyms to be in position 0, but not all of them.
Hi, @clintongormley
I got it! Thanks for your help.
Ivan
@clintongormley I think this problem with the positions of the word_delimiter filter should be mentioned on the respective reference / guide pages... Just ran into the same thing.
I am trying to fix this issue in Lucene: https://issues.apache.org/jira/browse/LUCENE-7619
It would mean you need to include c:WordDelimiterGraphFilter (once it's released) in your search-time analyzer.
WordDelimiterGraphFilter is now released and available in v5.4. FYI to those who stumble upon this thread. thanks @mikemccand for this!!
V
Most helpful comment
WordDelimiterGraphFilter is now released and available in v5.4. FYI to those who stumble upon this thread. thanks @mikemccand for this!!
V