Elasticsearch: Error on reindex using WordNet synonyms file

Created on 13 Dec 2017  路  11Comments  路  Source: elastic/elasticsearch

Elasticsearch version (bin/elasticsearch --version): 6.0.0

Plugins installed: []

JVM version (java -version): 1.8.0_51-b16

OS version (uname -a if on a Unix-like system): macOS 10.12.6 (I know it's not supported, see below)

Description of the problem including expected versus actual behavior:

I'm encountering the following error on indexing when trying to use the wn_s.pl synonyms file (which I've moved to /usr/local/etc/elasticsearch):

{
    "error": {
        "root_cause": [{
            "type": "illegal_argument_exception",
            "reason": "failed to build synonyms"
        }],
        "type": "illegal_argument_exception",
        "reason": "failed to build synonyms",
        "caused_by": {
            "type": "parse_exception",
            "reason": "Invalid synonym rule at line 2",
            "caused_by": {
                "type": "illegal_argument_exception",
                "reason": "term: physical entity analyzed to a token with posinc != 1"
            }
        }
    }
}

Here's the line it's objecting to:

s(100001930,1,'physical entity',n,1,0). 

I'm using the WordNet Prolog synonyms file from http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz2

Downgrading to 5.6 resolved the issue, so the 6.0.0 release seems to have introduced the bug. I opened this bug despite this happening on an unsupported system at the suggestion of @spinscale in the forums.

Happy to provide any other relevant info!

:CorFeatureIndices APIs

Most helpful comment

Thanks so much for the help, everyone! I'll open an issue on searchkick directly.

All 11 comments

Looks like this is not ES issue. This comes from the Apache Lucene (ES 5.6 used a different Lucene version). Particularly, an error comes from Lucene's classes of WordnetSynonymParser and SynonymMap.

You should raise this issue in: https://issues.apache.org/jira/projects/LUCENE

Thanks for running that down! Not sure when I'll have time to report that to them though, TBH.

For anybody else that needs WordNet support, though, the workaround at the moment is to just downgrade to 5.6.

@techpeace thanks. I have created a Lucene issue for this problem: https://issues.apache.org/jira/browse/LUCENE-8100

@techpeace What analyzers are you using for indexing synonyms?

++ to see the analysis chain; e.g. are you running StopFilter before the syn filter?

It looks as though the problem is the analyzer used to parse the synonyms file. SynonymMap checks that the analysis output is a continuous stream rather than a graph, and the (fairly cryptic!) error message @techpeace reports above indicates that either there are extra tokens or gaps in the stream.

In 5.6 elasticsearch used a predefined tokenizer (whitespace by default) and optionally lowercased everything, in 6.0 it uses the analysis chain defined on that field up to the synonym filter. So at a guess your analysis chain is producing a graph somehow, which then makes the SynonymMap fail.

It can also happens with TokenFilter that repeats some term like a phonetic filter that preserves the original term or the repeat token filter. Stacked tokens are not allowed when analyzing synonym rules, see https://github.com/elastic/elasticsearch/issues/27481#issuecomment-346806566
Anyway your analysis chain contains the answer...

I'm new to Elasticsearch, so I will admit to not knowing what the analysis chain is, nor how to report its contents. What I can say is that this was a stock Elasticsearch install, and I did nothing (I'm aware of) to configure the analysis chain. I was using the searchkick library inside of a Rails app.

I'd be happy to stand up another instance running 6.0.0 and report the contents of the analysis chain if somebody could fill me in on how to do that. I can also open an issue on searchkick if it's determined that's where the problem resides.

If you run this against your ES instance it should tell us how the synonym has been analyzed. You'll need to replace [elasticsearch-url], [index] and [field] with the relevant examples from your setup.

curl -XGET '[elasticsearch-url]/[index]/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
  "field" : "[field]",
  "text" : "physical entity"
}
'

I checked the code in https://github.com/ankane/searchkick, each field is defined with a shingle filter that build n-grams of size 1 and 2. This means that the synonym rule physical entity creates three tokens, physical, entity and physical entity (the shingle of size 2). This filter is set before the synonym filter so it breaks the synonym map with an ambiguous rule, synonym inputs cannot have multiple forms. I hope you don't mind if I close this issue, the breaking change is documented so you should open an issue in the searchkick repository directly (you can add a link to this issue to provide more infos for the searchkick maintainers).

Thanks so much for the help, everyone! I'll open an issue on searchkick directly.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

malpani picture malpani  路  3Comments

dawi picture dawi  路  3Comments

jpountz picture jpountz  路  3Comments

ttaranov picture ttaranov  路  3Comments

Praveen82 picture Praveen82  路  3Comments