Describe the feature:
Elasticsearch version (bin/elasticsearch --version):6.1, 6.2
Plugins installed: []
JVM version (java -version):1.08
OS version (uname -a if on a Unix-like system): ubuntu14
Description of the problem including expected versus actual behavior:
The following index-setting and mapping script works in versions 5.3.X and 5.6.X but fails when run against V 6.1 and V6.2:
DELETE test
PUT test
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"split_words": {
"split_on_numerics": false,
"generate_word_parts": true,
"type": "word_delimiter",
"preserve_original": true,
"stem_english_possessive": false
}
},
"analyzer": {
"path": {
"filter": [
"split_words"
],
"tokenizer": "file_path_tokenizer"
}
},
"tokenizer": {
"file_path_tokenizer": {
"reverse": "true",
"type": "path_hierarchy"
}
}
}
}
},
"mappings": {
"doc": {
"dynamic": "false",
"properties": {
"name": {
"type": "text",
"analyzer": "path"
}
}
}
}
}
POST test/doc/1
{"name": "HKLM/SOFTWARE/soft"}
GET /test/_analyze
{
"field": "name",
"text": [
"HKLM/SOFTWARE/soft"
]
}
Comments
the above script is actually a simplified version of the original script which reproduces the problem.
when run against V5 elasticsearch the POST command given in the above script will index the document; when run against V6.1 or V6.2 the same command returns the following error:
> {"name": "HKLM\\SOFTWARE\\soft"}'
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=5,endOffset=18,lastStartOffset=14 for field 'name'"
}
],
"type" : "illegal_argument_exception",
"reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=5,endOffset=18,lastStartOffset=14 for field 'name'"
},
"status" : 400
}
@romseygeek Would you be able to have a look at this?
Lucene 7.0 made the IndexWriter more aggressive about rejecting backwards offsets, which can cause exceptions in highlighting (https://issues.apache.org/jira/browse/LUCENE-7626). To fix this, you should be able to replace word_delimiter with word_delimiter_graph, which has offset correction logic.
@romseygeek but this is still a bug, correct? And the use of word_delimiter_graph is only an workaround.
It's a pre-existing bug in word_delimiter that's only been exposed at index time recently, yes. WordDelimiterFilter (the underlying lucene tokenfilter here) has been deprecated in favour of WordDelimiterGraphFilter, so it's unlikely to be fixed. We should also deprecate it in elasticsearch, and point users to word_delimiter_graph instead.
We can confirm that the problem seems to be solved by using "word_delimiter_graph". Based on the exchange above, I'll leave it to see as to when to close this GitHub entry, if we agree that the this is the workaround.
I'm closing this for now, and have opened a new issue to deprecate word_delimiter (#29061)
Most helpful comment
It's a pre-existing bug in
word_delimiterthat's only been exposed at index time recently, yes. WordDelimiterFilter (the underlying lucene tokenfilter here) has been deprecated in favour of WordDelimiterGraphFilter, so it's unlikely to be fixed. We should also deprecate it in elasticsearch, and point users toword_delimiter_graphinstead.