This effects 7.7.0+
This is a pretty particular error to hit, so I am assuming the regression was not intentional. It appears that for an invalid scripted field, it is failing only one of the shards and skipping the rest which is resulting in a hits response instead of an error.
This change broke a test we had in Kibana, preventing the promotion of our nightly snapshot. Looking for guidance on if this is an accidental regression or we should update our checks accordingly.
Here is the minimum reproduction I could come up with:
cURL friendly commands
curl -XPUT "http://localhost:9200/test-1" -H 'Content-Type: application/json' -d'{ "settings": { "number_of_shards": 1 }, "mappings": { "properties": { "@timestamp": { "type": "date" } } }}'
curl -XPUT "http://localhost:9200/test-2" -H 'Content-Type: application/json' -d'{ "settings": { "number_of_shards": 1 }, "mappings": { "properties": { "@timestamp": { "type": "date" } } }}'
curl -XPOST "http://localhost:9200/test-*/_search" -H 'Content-Type: application/json' -d'{ "version": true, "size": 500, "sort": [ { "@timestamp": { "order": "desc", "unmapped_type": "boolean" } } ], "script_fields": { "invalid_scripted_field": { "script": { "source": "invalid", "lang": "painless" } } }, "query": { "bool": { "must": [], "filter": [ { "match_all": {} }, { "range": { "@timestamp": { "gte": "2020-03-25T20:26:57.120Z", "lte": "2020-03-25T20:41:57.120Z", "format": "strict_date_optional_time" } } } ] } }}'
PUT /test-1
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
}
}
}
}
PUT /test-2
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
}
}
}
}
POST /test-*/_search
{
"version": true,
"size": 500,
"sort": [
{
"@timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"script_fields": {
"invalid_scripted_field": {
"script": {
"source": "invalid",
"lang": "painless"
}
}
},
"query": {
"bool": {
"must": [],
"filter": [
{
"match_all": {}
},
{
"range": {
"@timestamp": {
"gte": "2020-03-25T20:26:57.120Z",
"lte": "2020-03-25T20:41:57.120Z",
"format": "strict_date_optional_time"
}
}
}
]
}
}
}
This failed search attempt is returning _shards and hits, where the only mention of a failure is within the _shards.failures.
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 1,
"skipped" : 1,
"failed" : 1,
"failures" : [
{
"shard" : 0,
"index" : "test-1",
"node" : "3XFk8qtTTu-C0cBdpLOBPw",
"reason" : {
"type" : "script_exception",
"reason" : "compile error",
"script_stack" : [
"invalid",
"^---- HERE"
],
"script" : "invalid",
"lang" : "painless",
"position" : {
"offset" : 0,
"start" : 0,
"end" : 7
},
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "variable [invalid] is not defined"
}
}
}
]
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [ ]
}
}
Compare this to the previous behavior, where the response was an error with a root_cause and failed_shards:
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "compile error",
"script_stack" : [
"invalid",
"^---- HERE"
],
"script" : "invalid",
"lang" : "painless",
"position" : {
"offset" : 0,
"start" : 0,
"end" : 7
}
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "test-1",
"node" : "3XFk8qtTTu-C0cBdpLOBPw",
"reason" : {
"type" : "script_exception",
"reason" : "compile error",
"script_stack" : [
"invalid",
"^---- HERE"
],
"script" : "invalid",
"lang" : "painless",
"position" : {
"offset" : 0,
"start" : 0,
"end" : 7
},
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "variable [invalid] is not defined"
}
}
}
]
},
"status" : 400
}
Pinging @elastic/es-search (:Search/Search)
This is a side effect of https://github.com/elastic/elasticsearch/pull/53873 that changes the default for pre_filter_shard_size. We try to skip shards that cannot match the query more aggressively by default in 7.7.
However you can reproduce easily in previous versions by setting pre_filter_shard_size to 1 in your request:
POST /test-*/_search?pre_filter_shard_size=1
So this is not a regression per say, but I agree that it is confusing.
Currently we consider that a skipped shard is a successful shard.
This is why the response contains:
"_shards" : {
"total" : 2,
"successful" : 1,
"skipped" : 1,
"failed" : 1,
}
By default a search request is considered successful if at least one shard is successful.
To make the behavior more consistent with queries that don't skip shards, we should change the heuristic and consider a request successful only if if it has a non-skipped successful shard:
successful > 0 && successful > skipped.
It's currently impossible to skip all shards since we force at least one execution so total is always greater than skipped.
++ on failing the request if the only shard counted successful was in fact skipped.
I also wonder if we should look at improving the _shards section long-term. I know there are edge cases and that total is not always the sum of successful and failed, but the fact that skipped are also counted as successful is possibly counterintuitive?
That was fixed in https://github.com/elastic/elasticsearch/pull/64337, hence closing
Most helpful comment
This is a side effect of https://github.com/elastic/elasticsearch/pull/53873 that changes the default for
pre_filter_shard_size. We try to skip shards that cannot match the query more aggressively by default in 7.7.However you can reproduce easily in previous versions by setting
pre_filter_shard_sizeto 1 in your request:POST /test-*/_search?pre_filter_shard_size=1So this is not a regression per say, but I agree that it is confusing.
Currently we consider that a skipped shard is a successful shard.
This is why the response contains:
"_shards" : { "total" : 2, "successful" : 1, "skipped" : 1, "failed" : 1, }By default a search request is considered successful if at least one shard is successful.
To make the behavior more consistent with queries that don't skip shards, we should change the heuristic and consider a request successful only if if it has a non-skipped successful shard:
successful > 0 && successful > skipped.It's currently impossible to skip all shards since we force at least one execution so
totalis always greater thanskipped.