Elasticsearch: Elasticsearch: custom word-splitting index setting which works in 5.X fails with error in 6.X "illegal_argument_exception"

Created on 1 Feb 2018  路  6Comments  路  Source: elastic/elasticsearch

Describe the feature:

Elasticsearch version (bin/elasticsearch --version):6.1, 6.2

Plugins installed: []

JVM version (java -version):1.08

OS version (uname -a if on a Unix-like system): ubuntu14

Description of the problem including expected versus actual behavior:

The following index-setting and mapping script works in versions 5.3.X and 5.6.X but fails when run against V 6.1 and V6.2:

DELETE test
PUT test
{
    "settings": {
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0,
            "analysis": {
                "filter": {
                    "split_words": {
                        "split_on_numerics": false,
                        "generate_word_parts": true,
                        "type": "word_delimiter",
                        "preserve_original": true,
                        "stem_english_possessive": false
                    }
                },
                "analyzer": {
                    "path": {
                        "filter": [
                            "split_words"
                        ],
                        "tokenizer": "file_path_tokenizer"
                    }
                },
                "tokenizer": {
                    "file_path_tokenizer": {
                        "reverse": "true",
                        "type": "path_hierarchy"
                    }
                }
            }
        }
    },
    "mappings": {
        "doc": {
            "dynamic": "false",
            "properties": {
                "name": {
                    "type": "text",
                    "analyzer": "path"
                }
            }
        }
    }
}

POST test/doc/1
{"name": "HKLM/SOFTWARE/soft"}

GET /test/_analyze
{
  "field": "name",
  "text": [
    "HKLM/SOFTWARE/soft"
  ]
}

Comments

  • the above script is actually a simplified version of the original script which reproduces the problem.

  • when run against V5 elasticsearch the POST command given in the above script will index the document; when run against V6.1 or V6.2 the same command returns the following error:

> {"name": "HKLM\\SOFTWARE\\soft"}'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=5,endOffset=18,lastStartOffset=14 for field 'name'"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=5,endOffset=18,lastStartOffset=14 for field 'name'"
  },
  "status" : 400
}

  • error does not reproduce if the "split_words" filter is removed from the setting clause.
:SearcAnalysis >bug v6.1.0 v6.1.2

Most helpful comment

It's a pre-existing bug in word_delimiter that's only been exposed at index time recently, yes. WordDelimiterFilter (the underlying lucene tokenfilter here) has been deprecated in favour of WordDelimiterGraphFilter, so it's unlikely to be fixed. We should also deprecate it in elasticsearch, and point users to word_delimiter_graph instead.

All 6 comments

@romseygeek Would you be able to have a look at this?

Lucene 7.0 made the IndexWriter more aggressive about rejecting backwards offsets, which can cause exceptions in highlighting (https://issues.apache.org/jira/browse/LUCENE-7626). To fix this, you should be able to replace word_delimiter with word_delimiter_graph, which has offset correction logic.

@romseygeek but this is still a bug, correct? And the use of word_delimiter_graph is only an workaround.

It's a pre-existing bug in word_delimiter that's only been exposed at index time recently, yes. WordDelimiterFilter (the underlying lucene tokenfilter here) has been deprecated in favour of WordDelimiterGraphFilter, so it's unlikely to be fixed. We should also deprecate it in elasticsearch, and point users to word_delimiter_graph instead.

We can confirm that the problem seems to be solved by using "word_delimiter_graph". Based on the exchange above, I'll leave it to see as to when to close this GitHub entry, if we agree that the this is the workaround.

I'm closing this for now, and have opened a new issue to deprecate word_delimiter (#29061)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

jasontedor picture jasontedor  路  3Comments

clintongormley picture clintongormley  路  3Comments

Praveen82 picture Praveen82  路  3Comments

dadoonet picture dadoonet  路  3Comments