{
"_source": {
"base_info": {
"profile": null,
"name": null,
"headline": null,
"industry": null,
"avatar": null,
"email": "******@gmail.com",
"location": null
}
}
There is the data I have stored in ES and I can get them with RESTful API . The API returns me the full text .
When I get the doc use elasticsearch-dsl-py I also can get the data in search_result<response> like this: {'location': None, 'name': None, 'avatar': None, 'industry': None, 'profile': None, 'headline': None, 'email': '*****@gmail.com'}. But when I try to use result_list = [item.to_dict() for item in search_result]
I find the dict it returned is {'email': '******@gmail.com' } and the items which values = None has gone away!
I finall found that at elasticsearch_dsl/utils.py line: 371 .
# don't serialize empty values
Who can tell me the reason ?
These lines are there to ensure we are not storing empty fields that might have been created by, for example, just accessing a propery that has multi=True which implicitly creates a list under the hood.
May I ask why you care about those fields and index null values into elasticsearch? What is the use case? Thanks!
To my documents, some fields may be None. So I use null value in ES to ensure data structure integrity.
If your fields are None or if they are not present makes no difference when using the dsl library. Before 19649863ea92e748c83a6e3b4ccc75110d988030 there was a difference (discovered because of this issue, thank you!) where missing string fields would return '' instead of None, but that is now fixed so the behavior should be consistent.
How does not returning '' instead of None when accessing missing string fields imply not returning empty values from to_dict()? Seems quite unrelated to me.
@erosennin it is not related directly, it meant that there was no way to detect if a string field is empty or not defined, which are two different states. That is why it was fixed.
I am closing this issue since it is not clear why there is a need to store empty fields and it is inefficient. Please feel free to reopen with additional details. Thank you!
I'm reviving this issue for a proper discussion on why None values are removed in to_dict() utility function.
I see 3 states here:
Noneto_dict() makes an incorrect assumption that if field has a value of None it should be placed in state 3 along with the keys that don't exist in the result object. This is not correct.
I can't think of any other serializer that implicitly removes key if value is falsey. SQL databases return the key, any REST API serializers (DRF for example) return the key with None value. Even ES stores null keys. ES API itself returns the keys with null values. There's a reason for that and it's called consistency. Document (or result in this case) is an object and keys are part of the object's state.
Changing object's state by making assumption that returning empty fields is inefficient is wrong. At the very least, to_dict() should accept a param remove_empty and leave it up to the user if it's efficient or not.
Hi @GeorgeLubaretsi, thanks for your questions.
May I ask what is the difference for you, where does it matter that those empty fields get removed?
From the python API and elasticsearch's point of view I don't know of any difference and this code prevents surprises from side effects like accessing an undefined field with multi=True would set that field to an empty list (to allow people to do things like doc.tags.append("aa")).
Hi @HonzaKral, thanks for the quick response!
ES is the backend for our API endpoint and is integrated with Django Rest Framework. DRF expects a dictionary with all the keys present for the data model to correctly serialize them with appropriate types.
Even though I think that compatibility with Django and DRF should be considered as big chunk of es-dsl-py are going to be users of these frameworks, I don't want to make this issue about compatibility with them. It's a more generic issue. Let's say we have an endpoint that returns articles and their tags as nested objects:
class Article(DocType):
title = Text()
body = Text()
tags = Nested(properties={
'tag_name': String(),
'tag_href': String(),
'attributes_list': String(),
})
Now if we index some data and query ES, API returns something similar to this:
[
{
"title": "Title 1",
"body": "Body 1",
"tags": [{
"tag_name": "Tag 1",
"tag_href": null,
"attributes_list": []
}]
},
{
"title": "Title 2",
"body": "Body 2",
"tags": [{
"tag_name": "Tag 2",
"tag_href": "/tags/tag2",
"attributes_list": ["attr1", "attr2"]
}]
},
{
"title": "Title 3",
"body": null,
"tags": []
}
]
````
But after converting results to dictionary using `to_dict` and then converting exact same results back to JSON, here's what we get:
```json
[
{
"title": "Title 1",
"body": "Body 1",
"tags": [{
"tag_name": "Tag 1"
}]
},
{
"title": "Title 2",
"body": "Body 2",
"tags": [{
"tag_name": "Tag 2",
"tag_href": "/tags/tag2",
"attributes_list": ["attr1", "attr2"]
}]
},
{
"title": "Title 3",
}
]
We now have a different data even though we didn't change anything. So in case of the last object in the results, attributes_list was an empty array. Now it's missing. Same goes for tag_href, body etc.
Basically, what I'm saying is: [] !=== undefined
I understand your issue. The reason why we decided to ignore those empty fields is:
tags will return an empty list if not present on the document, accessing body would return None.tags on an empty document) and we don't want those to then bloat the document and the indexthat said we could totally include a flag to control that behavior for when the data is shared with other systems that don't have similar properties.
implemented in 0ee7feeadcc1fdcdb0817e5e0c6816c5f9cff728
@HonzaKral Question, why isn't the save_empty parameter implemented into the update function as well. I have a field that is populated that needs to be updated with an empty value.
Most helpful comment
implemented in 0ee7feeadcc1fdcdb0817e5e0c6816c5f9cff728