Elasticsearch: Bug: When using graph synonym and stop token filter together

Created on 27 Feb 2018  路  6Comments  路  Source: elastic/elasticsearch

Elasticsearch 6.2.0

Description:
When using stop and graph synonym filters together, the document that should match doesn't match and highlight doesn't work as it should.

Step to reproduce:

Mapping

{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "english_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            },
            "english_search_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "synonym_graph_tokenfilter",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            }
         },
         "filter":{  
            "english_stopwords_tokenfilter":{  
               "type":"stop",
               "stopwords":"_english_"
            },
            "synonym_graph_tokenfilter":{  
               "type":"synonym_graph",
               "synonyms":[  
                  "world of war, wow"
               ]
            }
         }
      }
   },
   "mappings":{  
      "doc":{  
         "properties":{  
            "title":{  
               "type":"text",
               "analyzer":"english_analyzer",
               "search_analyzer":"english_search_analyzer"
            }
         }
      }
   }
}

Indexing 3 documents

{  "title":"world of war"}
{  "title":"wow"}
{  "title":"world of war. wow"}

Search

{  
   "query":{  
      "match":{  
         "title":"world of war"
      }
   },
   "highlight":{  
      "fields":{  
         "title":{  
            "fragment_size":0,
            "type":"unified"
         }
      }
   }
}

Search Result:

{  
   "took":1,
   "timed_out":false,
   "_shards":{  
      "total":5,
      "successful":5,
      "skipped":0,
      "failed":0
   },
   "hits":{  
      "total":2,
      "max_score":0.2876821,
      "hits":[  
         {  
            "_index":"test",
            "_type":"doc",
            "_id":"2",
            "_score":0.2876821,
            "_source":{  
               "title":"world of war. wow"
            },
            "highlight":{  
               "title":[  
                  "world of war. <em>wow</em>"
               ]
            }
         },
         {  
            "_index":"test",
            "_type":"doc",
            "_id":"1",
            "_score":0.2876821,
            "_source":{  
               "title":"wow"
            },
            "highlight":{  
               "title":[  
                  "<em>wow</em>"
               ]
            }
         }
      ]
   }
}

Problems:
Bug 1. Document { "title":"world of war"} does not match. But it should match.
Bug 2. Highlighter does not highlight "world of war".

I have also tried to put synonym_graph_tokenfilter after english_stopwords_tokenfilter filter but I get:

{  
   "error":{  
      "root_cause":[  
         {  
            "type":"illegal_argument_exception",
            "reason":"failed to build synonyms"
         }
      ],
      "type":"illegal_argument_exception",
      "reason":"failed to build synonyms",
      "caused_by":{  
         "type":"parse_exception",
         "reason":"Invalid synonym rule at line 1",
         "caused_by":{  
            "type":"illegal_argument_exception",
            "reason":"term: world of war analyzed to a token (war) with position increment != 1 (got: 2)"
         }
      }
   },
   "status":400
}
:SearcAnalysis >bug

Most helpful comment

I am reopening this issue since it's a long standing bug and it's not resolved in Lucene.
The only workaround that work at the moment is to not use stop words, at index and query time.
You can define rules with and without stop words, for instance:
"world of war, world war, wow should match all variations.
Removing terms in a filter before or after the synonym graph should be avoided until the bug is resolved.
We want to solve this situation but it is not likely to happen before a major release considering the changes that are required on the analysis chain.

All 6 comments

cc @elastic/es-search-aggs

@romseygeek Could you take a look at this?

This is a known issue in Lucene and we're currently discussing different options for the fix:
https://issues.apache.org/jira/browse/LUCENE-8137
The only workaround for now is to not use the stop word filter when using the synonym_graph or to remove the stop words manually from the synonyms defined for the filter.

I will be closing this issue, as the issue in on the Lucene level (it has been opened and currently in progress), and there is nothing we ca do on the Elastic level.

Hey @jimczi - just wanted to follow up on this. I'm getting a similar issue. The exact bug above (where only 2 out of 3 matches are found) no longer occurs (I'm using ES 7.6.0) - good news. And if you switch the order of the stopword and synonym_graph filters, you still get the illegal_argument_exception as expected (the Lucene bug has not been fixed). HOWEVER, with the filters in the new order, the workaround described above does not work:

This is a known issue in Lucene and we're currently discussing different options for the fix:
https://issues.apache.org/jira/browse/LUCENE-8137
The only workaround for now is to not use the stop word filter when using the synonym_graph or to remove the stop words manually from the synonyms defined for the filter.

If in the example above, you put synonym graph filter AFTER the stopwords filter AND manually remove stopwords from the synonyms (i.e. now synonyms=["world war, wow"]), then a query with "world of war" CANNOT match text with "world of war. Did I misunderstand the workaround? (That's very likely because I imagine lots of people use synonym_graph with stopwords.)

Thanks in advance!

(PS: the reason I need to put synonym_graph AFTER stopwords is that the stopwords are case sensitive whereas the synonyms are not case sensitive)

If helpful, here are the requests I'm running:

PUT /test-xxx
{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "english_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter"
               ],
               "tokenizer":"standard"
            },
            "english_search_analyzer":{  
               "type":"custom",
               "filter":[  
                  "lowercase",
                  "english_stopwords_tokenfilter",
                  "synonym_graph_tokenfilter"
               ],
               "tokenizer":"standard"
            }
         },
         "filter":{  
            "english_stopwords_tokenfilter":{  
               "type":"stop",
               "stopwords":"_english_"
            },
            "synonym_graph_tokenfilter":{  
               "type":"synonym_graph",
               "synonyms":[  
                  "world war, wow"
               ]
            }
         }
      }
   },
   "mappings":{  
     "properties":{  
        "title":{  
           "type":"text",
           "analyzer":"english_analyzer",
           "search_analyzer":"english_search_analyzer"
        }
     }
   }
}

POST _bulk
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"wow" }
{ "index" : { "_index" : "test-xxx" } }
{ "title":"world of war. wow" }

GET /test-xxx/_search
{  
   "query":{  
      "match":{  
         "title":"world of war"
      }
   },
   "highlight":{  
      "fields":{  
         "title":{  
            "fragment_size":0,
            "type":"unified"
         }
      }
   }
}

DELETE /test-xxx

I am reopening this issue since it's a long standing bug and it's not resolved in Lucene.
The only workaround that work at the moment is to not use stop words, at index and query time.
You can define rules with and without stop words, for instance:
"world of war, world war, wow should match all variations.
Removing terms in a filter before or after the synonym graph should be avoided until the bug is resolved.
We want to solve this situation but it is not likely to happen before a major release considering the changes that are required on the analysis chain.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

clintongormley picture clintongormley  路  3Comments

ttaranov picture ttaranov  路  3Comments

makeyang picture makeyang  路  3Comments

ppf2 picture ppf2  路  3Comments

abtpst picture abtpst  路  3Comments