Elasticsearch: 'Failed to build synonyms' when using delimiter_graph + synonym_graph - 6.2.3

Created on 9 Apr 2018  路  22Comments  路  Source: elastic/elasticsearch

Hi there,

In the process of upgrading one of my clients from ES5.5.1 -> ES6.2.3, I got into an issue when trying to create an index in ES6. I worked out a small snippet to highlight my issue:

PUT test { "settings": { "analysis": { "filter": { "delimiter_search": { "type": "word_delimiter_graph", "catenate_all": "true" }, "synonyms": { "type": "synonym_graph", "synonyms": [ "test1=>test" ] } }, "analyzer": { "match_analyzer_search": { "tokenizer": "whitespace", "filter": [ "trim", "asciifolding", "delimiter_search", "lowercase", "synonyms" ] } } } } }
Generates the following error:

{ "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "failed to build synonyms" } ], "type": "illegal_argument_exception", "reason": "failed to build synonyms", "caused_by": { "type": "parse_exception", "reason": "Invalid synonym rule at line 1", "caused_by": { "type": "illegal_argument_exception", "reason": "term: test1 analyzed to a token (test) with position increment != 1 (got: 0)" } } }, "status": 400 }

The actual error is in my custom synonym file, but I managed to reproduce with a single term.
If I remove delimiter_search from the analyzer, there are no problems creating the index.
The above works in ES5.5.1

:SearcAnalysis >enhancement Search

Most helpful comment

I am having the same problems when I upgrade to ES 7, from ES 5: [word_delimiter] cannot be used to parse synonyms. I think in ES 5 we have warnings but stop working in ES 7.
Anyone has some solutions ?
The multiplexer solution @romseygeek mentioned will work but generate different tokens.

All 22 comments

Thanks for your report! The root cause seems to be that since 6.0, the synonym_graph filter, tokenizes synonyms with tokenizers / token filters that appear before it in the chain (see docs). As the word_delimiter_graph is set to catenate_all = true, the above error happens.

Honestly, I am not sure whether the behaviour here is intended or not, hence deferring to the search / aggs team for a definitive answer.

Pinging @elastic/es-search-aggs

@romseygeek could you take a look at this one?

@danielmitterdorfer is correct. We now use the preceding tokenizer chain to analyze terms in the synonym map, and word_delimiter_graph is producing multiple tokens at the same position, which the map builder doesn't know how to handle.

In the case above, removing the term1=>term mapping should still work, because the delimiter filter is in effect already doing exactly that: term1 produces term, 1 and term1. For other entries you may need to reduce the left hand side of the mapping down to just the part of the term that the delimiter filter outputs.

Thank you for your reply @romseygeek

Synonyms are created by people in the organization and loaded into a new index every 3 hours. Since we're not in full control of this (huge) file, plus the people that enter them don't have any knowledge of Elasticsearch internals, it's hard to filter out these synonyms before creating an index. This is currently an automated process.
Is there any other way to fix this, other than having to delete the synonyms?

Is there any other way to fix this, other than having to delete the synonyms?

I think it would be possible to extend the SynonymMap parsing so that it could handle graph tokenstreams, but it wouldn't be simple. The other immediate workaround would be to see if you really need to have the word delimiter filter in there.

I have exactly same problem when using stopwords + synonym graph

Is this issue related to #30968?

Yes, I think #30968 will fix this

Or at least provide a workaround for cases where it's difficult to control and/or sanitise the synonyms list.

@romseygeek Not sure if I should add this to this issue or not, but I think just adding flag to ignore exceptions doesn't really fully cover the problem introduced by this check.

Take the following setup (stripped down version of actual production mapping in ES 6.3.2):

PUT test { "settings": { "analysis": { "filter": { "delimiter": { "type": "word_delimiter", "catenate_all": true, "split_on_numerics": "true", "preserve_original": "true" }, "word_breaks": { "type": "synonym", "synonyms": [ "snowboard,snow board=>snow_board" ] } }, "analyzer": { "match_analyzer_index": { "tokenizer": "whitespace", "filter": [ "asciifolding", "lowercase", "delimiter", "word_breaks" ] }, "match_analyzer_search": { "tokenizer": "whitespace", "filter": [ "asciifolding", "lowercase", "delimiter", "word_breaks" ] } } } } }

Trying to create this index fails with the following error: "term: snow_board analyzed to a token (snow) with position increment != 1 (got: 0)". Does it make sense for the new parsing to also apply tokenization to synonyms on the right hand side of the arrow?
Note that this setup is without the graph versions of delimiter&synonyms.

Trying to create this index fails with the following error: "term: snow_board analyzed to a token (snow) with position increment != 1 (got: 0)". Does it make sense for the new parsing to also apply tokenization to synonyms on the right hand side of the arrow?

It is required otherwise you'll index terms (snow_board) that you cannot search. I think that your problem here is different, you want to apply a word_delimiter and a synonym filter in the same chain but they don't work well together. The synonym and synonym_graph filter are not able to properly consume a stream that contains multiple terms at the same position (that's what the word_delimiter produces when preserve_original is set to true). You'll need to make sure that your synonym rules contains already delimited input/output.
Regarding the lenient option, it works fine in this case, it ignores the snow_board rule when it is set to true and fails with an exception if false.

This issue also occurs when you have a filter like hunspell which can, for some words, produce multiple variants of the same token. In our case using the nl_NL locale for hunspell and the alias rule fiets, stalen ros this completely breaks even though stalen ros are valid tokens that the hunspell doesn't remove. Instead it adds additional tokens (stal and staal) into the stream.

This check then doesn't prevent invalid unsearchable tokens but instead prevents a perfectly valid synonym from being used.

Also please note that with index-time synonyms it is quite common that people use a different search_analyzer without the synonyms which can produce different tokens so our assumptions on what you cannot search can be wrong.

reproduction:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "test": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "stem",
            "synonyms"
          ]
        }
      },
      "filter": {
        "stem": {
          "type": "hunspell",
          "locale": "nl_NL"
        },
        "synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "fiets, stalen ros"
          ]
        }
      }
    }
  }
}

@HonzaKral I think I choose a poor example, as my actual problem occurs with Dutch language.
@jimczi This is a way better explanation of the same issue as I've been having. This change in synonym analysis is limiting the way synonyms can be used.

The problem is not limited to tokenization. The problem occurs when there is any preceeding graph filter. For example, if the synonyms have been broken into multiple files:
layer_1.txt:
dog => dog, canine

layer_2.txt:
dogfood, dog food

Then you will encounter the same error without the "lenient" option. With lenient, the decompounding rule is not added, but ignored. This example is a little artificial, but 2 cents.

PUT test/_settings
{
        "settings": {
            "analysis": {
                "analyzer": {
                  "test_synonyms": {
                        "type": "custom",
                        "char_filter": [
                            "html_strip"
                        ],
                        "filter": [
                            "asciifolding",
                            "synonym_layer_1",
                            "flatten_graph",
                            "synonym_layer_2"
                        ],
                        "tokenizer":"whitespace"
                    }
                },
                "filter": {
                    "synonym_layer_1": {
                        "type": "synonym_graph",
                        "synonyms_path": "layer_1.txt"
                    },
                    "synonym_layer_2": {
                        "type": "synonym_graph",
                        "synonyms_path": "layer_2.txt"
                    }            
            }
        }
    }
}

Using the multiplexer filter might help here, I think. If we want to apply both word_delimiter and synonyms, but avoid them interacting with each other, we can put them into separate branches; rewriting the settings in the opening post yields this:

PUT test
{
  "settings": {
    "analysis": {
      "filter": {
        "delimiter_search": {
          "type": "word_delimiter_graph",
          "catenate_all": "true",
          "adjust_offsets" : "false"
        },
        "synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "test1=>test"
          ]
        },
        "split" : {
          "type" : "multiplexer",
          "filters" : [
            "delimiter_search,lowercase", 
            "lowercase,synonyms"]
        }
      },
      "analyzer": {
        "match_analyzer_search": {
          "tokenizer": "whitespace",
          "filter": [
            "trim",
            "asciifolding",
            "split"
          ]
        }
      }
    }
  }
}

Note that we need to set adjust_offsets to false in the delimiter_search filter, as otherwise we end up with backwards offsets. This happily tokenizes test1 into the tokens test1, test, 1 and test - the last one being the synonym.

Did somebody find a solution for this? This still seems to be a problem with ES 7.1

@dkln did you try the solution using the multiplexer detailed above?

Yes but that didn't seem to work. I ended up using a char_filter

In my case I could resolve the errors by setting "lenient": true for the synonym filter and "adjust_offsets": false for the delimiter filter. In my case I did not need multiplexer.

Before I got the error (with only "lenient": true):

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=62,endOffset=69,lastStartOffset=63 for field ...

or without lenient:

term: analyzed (...) to a token (...) with position increment != 1 (got: 2)

I am having the same problems when I upgrade to ES 7, from ES 5: [word_delimiter] cannot be used to parse synonyms. I think in ES 5 we have warnings but stop working in ES 7.
Anyone has some solutions ?
The multiplexer solution @romseygeek mentioned will work but generate different tokens.

Got the same problem after upgrade from 6.5 to 7.
In 6.5 it works as expected. Documentation does not cover this case.

Tried the multiplexer solution and got a lot of: IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=*,endOffset=*,lastStartOffset=* for field ...

Was this page helpful?
0 / 5 - 0 ratings