Elasticsearch version: 5.2.2
Plugins installed: none
JVM version: 1.8.0_111
OS version: Windows 10
Description of the problem including expected versus actual behavior:
Just ungraded from 5.0.1 and now hitting this issue.
When a document contains no nested documents in its array property and we use source filtering to exclude any nested document property, the result is missing the entire nested documents empty array property.
Steps to reproduce:
2. Add couple of documents, 1 of which should have empty array of nesteddocuments
{
"strprop": "document string property",
"boolprop": true,
"intprop": 54321,
"mynesteddocuments": []
}
3. Search for the documents with source filtering to exclude say the "intprop" property.
{
"_source": {
"excludes": [
"mynesteddocuments.intprop"
]
},
"query": {
"match_all": {}
}
}
```
Believe it may be somewhat related to #22557 and #22593
Full script to reproduce:
PUT test
{
"mappings": {
"mydocument": {
"properties": {
"mynesteddocuments": {
"type": "nested"
}
}
}
}
}
POST test/mydocument/1
{
"strprop": "document string property",
"boolprop": true,
"intprop": 54321,
"mynesteddocuments": []
}
POST test/mydocument/2
{
"strprop": "document string property",
"boolprop": true,
"intprop": 54322,
"mynesteddocuments": [{ "foo:": "bar"}]
}
POST test/mydocument/3
{
"strprop": "document string property",
"boolprop": true,
"intprop": 54323,
"mynesteddocuments": [{ "foo:": "bar"}]
}
GET test/_search
{
"_source": {
"excludes": [
"mynesteddocuments.intprop"
]
},
"query": {
"match_all": {}
}
}
Note that in the above the source filtering is excluding mynesteddocuments.intprop which does not exist on any of the nested documents. If you instead exclude intprop (see below) the document with no nested docs is returned with "mynesteddocuments": [], so the mynesteddocumentsarray is only not printed when the source excludes is excluding fields in the nested document.
GET test/_search
{
"_source": {
"excludes": [
"intprop"
]
},
"query": {
"match_all": {}
}
}
Also note that this is not specific to nested documents. If you index the array as an embedded object (See below) you can reproduce the same thing:
DELETE test
POST test/mydocument/1
{
"strprop": "document string property",
"boolprop": true,
"intprop": 54321,
"mynesteddocuments": []
}
POST test/mydocument/2
{
"strprop": "document string property",
"boolprop": true,
"intprop": 54322,
"mynesteddocuments": [{ "foo:": "bar"}]
}
POST test/mydocument/3
{
"strprop": "document string property",
"boolprop": true,
"intprop": 54323,
"mynesteddocuments": [{ "foo:": "bar"}]
}
GET test/_search
{
"_source": {
"excludes": [
"mynesteddocuments.intprop"
]
},
"query": {
"match_all": {}
}
}
This needs more investigation - need to figure out the edge cases before we can figure out how to make things more consistent.
What should we do with dots in field names eg foo.bar.baz and you exclude foo.bar?
Hi!
We have a lot of issues on our API because of this breaking change… Instead of just returning an empty array, we have now to deal with the case where the field is missing and to replace it with an empty array on the fly, in order to prevent all our applications to crash when testing the length of the array… 😞
Is it acceptable to restore previous behavior in a fix release?
@b-viguier yeah, this is a tricky area where we tried to fix things only ending up breaking something else (in this case your use case, sorry for that). We have decided to take a step back and think about the entire source filtering logic, including more edge cases. No ETA known at the moment.
@bleskes Thank you very much for this feedback and your work about this.
We stay tuned for any news 👍
@elastic/es-search-aggs
For anyone else suffering from this issue, we were able to set up a workaround to return the empty lists by including the name of the nested parent in the _source in 6.3.0
Example:
For indexed model
{
"foo": {
"id": "value1",
"bar": []
}
}
$ curl elasticsearch:9200/foo/_search -d '{"_source": ["id", "bar.id"]}` -H 'Content-Type: application/json'
{
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
...
"_source": {
"id": "value1"
},
"_type": "foo"
}
],
"max_score": 1.0,
"total": 1
}
}
We were able to fix by adding "bar" to the _source
$ curl elasticsearch:9200/foo/_search -d '{"_source": ["id", "bar", "bar.id"]}` -H 'Content-Type: application/json'
{
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
...
"_source": {
"id": "value1",
"bar": []
},
"_type": "foo"
}
],
"max_score": 1.0,
"total": 1
}
}
While our solution described by @MacMcIrish works, it also returns more data that was not explicitly requested. We now prune the result in application logic again as a second step.
Overall this is quite a costly bug for us. Would really like to see this fixed.
Any update on this?
This issue has become a problem for us again, we have some very large nested fields that we cannot reliably trim out of our response due to needing the above workaround. This has led to huge response sizes and long JSON parsing times.
We now improved our workaround from above with a multi stage fix.
0) This will require maintaining a schema locally that needs to be in sync with elasticsearch and differentiates between single and array nested docs
1) Further all (nested) docs have to have an id (or otherwise always present) field
2) Before a request is made: For the fields requested, also request all the ids of all implicitly requested parent docs (this is necessary for multi nested array targets)
3) For all nested docs in your local schema that are arrays, where the direct parent is present in the result but the nested docs are not, inject an empty array
4) Prune all ids that were not originally requested
Shortcomings: Can not differentiate between an empty array and an array that was not indexed. For us this is not an issue, but this depends on your use case.
Not great, but this seems to work so far :crossed_fingers:
Do we have any update on this issue ?
I can see that the changed source filtering behavior still exists in the latest version 7.3.6.
Can we not implement flag based configuration where the user can choose between new and old behavior ?
I'll chime in since this has been open since 2017 with no resolution and is a problematic and unexpected behavior that differs from the expectation: that source excludes would exclude only the _explicit_ exclusion keys in the document and _not_ omit keys not explicitly requested for exclusion.
Seems like this should be a priority issue, as it requires consumers to code around behavior and minimizes the value of source exclusions in streamlining or redacting documents.
@jclausen Very much agreed. We use our workaround at scale (millions of queries per day) and it works really well.
However resolving this properly would allow us to remove quite a bit of hacky code.
@javanna
While exclusions seems to be fixed in version 7.9.0. I am wondering if there is a similar change planned for inclusions as well.
Consider the following scenario.
`POST /test/_doc/1
{
"field" : "value",
"array": [],
"object" : {}
}
POST /test/_doc/2
{
"field" : "value",
"array": [{ "exclude": "bar"}],
"object": { "exclude": "bar" }
}
POST /test/_doc/3
{
"field" : "value",
"array": [{ "exclude": "bar"}, {"include" : "bar"}],
"object": { "exclude": "bar", "include" : "bar" }
}
POST /test/_search?pretty
{
"_source": {"includes": ["name","object.include","array.include"] }
}
{
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "82HEBnQBxtHMu64O7lFB",
"_score": 1,
"_source": {}
},
{
"_index": "test",
"_type": "_doc",
"_id": "9GHFBnQBxtHMu64OElFU",
"_score": 1,
"_source": {}
},
{
"_index": "test",
"_type": "_doc",
"_id": "9WHFBnQBxtHMu64ONFGE",
"_score": 1,
"_source": {
"array": [
{
"include": "bar"
}
],
"object": {
"include": "bar"
}
}
}
]
}`
Shouldn't the first 2 results include empty Arrays/Objects as well ? Not including them creates a bigger problem for nested scenario as shown below,
`PUT /test1
{
"mappings": {
"properties": {
"product": { "type": "text" },
"Users": {
"type": "nested",
"properties": {
"name": { "type": "text" },
"Address": {
"type": "nested",
"properties": {
"country": { "type": "text" }
}
}
}
}
}
}
}
POST /test1/_doc/1
{
"product": "Laptop",
"Users": [
{ "name": "user1", "Address": [{ "country": "country1" }] },
{ "name": "user2", "Address": [] },
{ "name": "user3", "Address": [{ "country": "country2" }] },
{ "name": "user4", "Address": [{ "country": "country3" }] }
]
}
POST /test1/_doc/_search
{
"query": {
"nested": {
"path": "Users.Address",
"inner_hits":{},
"query": {
"match" :{ "Users.Address.country": "country2"}
}
}
},
"_source": ["Users.Address.country"]
}
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.9808291,
"hits": [
{
"_index": "test1",
"_type": "_doc",
"_id": "1",
"_score": 0.9808291,
"_source": {
"Users": [
{
"Address": [
{
"country": "country1"
}
]
},
{
"Address": [
{
"country": "country2"
}
]
},
{
"Address": [
{
"country": "country3"
}
]
}
]
},
"inner_hits": {
"Users.Address": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.9808291,
"hits": [
{
"_index": "test1",
"_type": "_doc",
"_id": "1",
"_nested": {
"field": "Users",
"offset": 2,
"_nested": {
"field": "Address",
"offset": 0
}
},
"_score": 0.9808291,
"_source": {
"country": "country2"
}
}
]
}
}
}
}
]
}`
In above response Innher_hits is giving Users[2].Address[0] as a match for condition country= "country2". However, in the _source object Users[2] appears at index 1 of Users Array. This creates reliability concerns for offset attribute. For elastic-search versions prior to 5.X, we used to get empty [] if no nested attribute was found and offsets could be reliably used to pull data from _source object.
heya @somdevmehta sorry it took me so long to reply to your comment. I did some digging on this, and I think with inclusions you can have empty arrays preserved by including them explicitly. In your example you would do "_source": {"includes": ["name","object","object.include","array","array.include"] }. The point is that when including, only what you explicitly included will be added to the response. For exclusions it is a bit different as you are excluding and expect all the rest to stay, while before my fix we would remove arrays and objects when they were left empty.
Hey @javanna , Thanks for your reply. The problem I see while including the parent attribute ("array" in above case) in _source is that it will cause a lot of unnecessary attributes to appear in response. In my example, assuming that every array element is a JSONObject with 50 + attributes each, then all of these will be returned in the response, even though only 1 or 2 are actually needed. Also, such workaround was not required for versions before 5.X.
you are right thanks for the feedback @somdevmehta would you mind opening a new issue to track this please?
Most helpful comment
While our solution described by @MacMcIrish works, it also returns more data that was not explicitly requested. We now prune the result in application logic again as a second step.
Overall this is quite a costly bug for us. Would really like to see this fixed.