Elasticsearch: Elasticsearch: custom word-splitting index setting which works in 5.X fails with error in 6.X "illegal_argument_exception"

Created on 1 Feb 2018 · 6Comments · Source: elastic/elasticsearch

Describe the feature:

Elasticsearch version (bin/elasticsearch --version):6.1, 6.2

Plugins installed: []

JVM version (java -version):1.08

OS version (uname -a if on a Unix-like system): ubuntu14

Description of the problem including expected versus actual behavior:

The following index-setting and mapping script works in versions 5.3.X and 5.6.X but fails when run against V 6.1 and V6.2:

DELETE test
PUT test
{
    "settings": {
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0,
            "analysis": {
                "filter": {
                    "split_words": {
                        "split_on_numerics": false,
                        "generate_word_parts": true,
                        "type": "word_delimiter",
                        "preserve_original": true,
                        "stem_english_possessive": false
                    }
                },
                "analyzer": {
                    "path": {
                        "filter": [
                            "split_words"
                        ],
                        "tokenizer": "file_path_tokenizer"
                    }
                },
                "tokenizer": {
                    "file_path_tokenizer": {
                        "reverse": "true",
                        "type": "path_hierarchy"
                    }
                }
            }
        }
    },
    "mappings": {
        "doc": {
            "dynamic": "false",
            "properties": {
                "name": {
                    "type": "text",
                    "analyzer": "path"
                }
            }
        }
    }
}

POST test/doc/1
{"name": "HKLM/SOFTWARE/soft"}

GET /test/_analyze
{
  "field": "name",
  "text": [
    "HKLM/SOFTWARE/soft"
  ]
}

Comments

the above script is actually a simplified version of the original script which reproduces the problem.
when run against V5 elasticsearch the POST command given in the above script will index the document; when run against V6.1 or V6.2 the same command returns the following error:

> {"name": "HKLM\\SOFTWARE\\soft"}'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=5,endOffset=18,lastStartOffset=14 for field 'name'"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=5,endOffset=18,lastStartOffset=14 for field 'name'"
  },
  "status" : 400
}

error does not reproduce if the "split_words" filter is removed from the setting clause.

:SearcAnalysis >bug v6.1.0 v6.1.2

Source

MorrieAtElastic

Most helpful comment

It's a pre-existing bug in word_delimiter that's only been exposed at index time recently, yes. WordDelimiterFilter (the underlying lucene tokenfilter here) has been deprecated in favour of WordDelimiterGraphFilter, so it's unlikely to be fixed. We should also deprecate it in elasticsearch, and point users to word_delimiter_graph instead.

romseygeek on 1 Feb 2018

👍3

All 6 comments

@romseygeek Would you be able to have a look at this?

colings86 on 1 Feb 2018

Lucene 7.0 made the IndexWriter more aggressive about rejecting backwards offsets, which can cause exceptions in highlighting (https://issues.apache.org/jira/browse/LUCENE-7626). To fix this, you should be able to replace word_delimiter with word_delimiter_graph, which has offset correction logic.

romseygeek on 1 Feb 2018

@romseygeek but this is still a bug, correct? And the use of word_delimiter_graph is only an workaround.

astefan on 1 Feb 2018

romseygeek on 1 Feb 2018

👍3

We can confirm that the problem seems to be solved by using "word_delimiter_graph". Based on the exchange above, I'll leave it to see as to when to close this GitHub entry, if we agree that the this is the workaround.

MorrieAtElastic on 4 Feb 2018

I'm closing this for now, and have opened a new issue to deprecate word_delimiter (#29061)

romseygeek on 14 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings