Elasticsearch-dsl-py: lost data when use to_dict() to result

Created on 11 Sep 2016 · 12Comments · Source: elastic/elasticsearch-dsl-py

{
  "_source": {
    "base_info": {
      "profile": null,
      "name": null,
      "headline": null,
      "industry": null,
      "avatar": null,
      "email": "******@gmail.com",
      "location": null
    }
}

There is the data I have stored in ES and I can get them with RESTful API . The API returns me the full text .
When I get the doc use elasticsearch-dsl-py I also can get the data in search_result<response> like this: {'location': None, 'name': None, 'avatar': None, 'industry': None, 'profile': None, 'headline': None, 'email': '*****@gmail.com'}. But when I try to use result_list = [item.to_dict() for item in search_result]
I find the dict it returned is {'email': '******@gmail.com' } and the items which values = None has gone away!

Source

thisisx7

Most helpful comment

implemented in 0ee7feeadcc1fdcdb0817e5e0c6816c5f9cff728

HonzaKral on 19 Jan 2018

❤3

All 12 comments

I finall found that at elasticsearch_dsl/utils.py line: 371 .

# don't serialize empty values

Who can tell me the reason ?

thisisx7 on 11 Sep 2016

These lines are there to ensure we are not storing empty fields that might have been created by, for example, just accessing a propery that has multi=True which implicitly creates a list under the hood.

May I ask why you care about those fields and index null values into elasticsearch? What is the use case? Thanks!

HonzaKral on 7 Oct 2016

To my documents, some fields may be None. So I use null value in ES to ensure data structure integrity.

thisisx7 on 9 Oct 2016

If your fields are None or if they are not present makes no difference when using the dsl library. Before 19649863ea92e748c83a6e3b4ccc75110d988030 there was a difference (discovered because of this issue, thank you!) where missing string fields would return '' instead of None, but that is now fixed so the behavior should be consistent.

HonzaKral on 10 Oct 2016

How does not returning '' instead of None when accessing missing string fields imply not returning empty values from to_dict()? Seems quite unrelated to me.

erosennin on 13 Oct 2016

@erosennin it is not related directly, it meant that there was no way to detect if a string field is empty or not defined, which are two different states. That is why it was fixed.

I am closing this issue since it is not clear why there is a need to store empty fields and it is inefficient. Please feel free to reopen with additional details. Thank you!

HonzaKral on 30 Nov 2016

I'm reviving this issue for a proper discussion on why None values are removed in to_dict() utility function.

I see 3 states here:

Field is defined in the index and has value of None
Field is defined and has empty string as a value.
Field is not defined.

to_dict() makes an incorrect assumption that if field has a value of None it should be placed in state 3 along with the keys that don't exist in the result object. This is not correct.

I can't think of any other serializer that implicitly removes key if value is falsey. SQL databases return the key, any REST API serializers (DRF for example) return the key with None value. Even ES stores null keys. ES API itself returns the keys with null values. There's a reason for that and it's called consistency. Document (or result in this case) is an object and keys are part of the object's state.

Changing object's state by making assumption that returning empty fields is inefficient is wrong. At the very least, to_dict() should accept a param remove_empty and leave it up to the user if it's efficient or not.

GeorgeLubaretsi on 15 Jan 2018

Hi @GeorgeLubaretsi, thanks for your questions.

May I ask what is the difference for you, where does it matter that those empty fields get removed?

From the python API and elasticsearch's point of view I don't know of any difference and this code prevents surprises from side effects like accessing an undefined field with multi=True would set that field to an empty list (to allow people to do things like doc.tags.append("aa")).

HonzaKral on 15 Jan 2018

Hi @HonzaKral, thanks for the quick response!

ES is the backend for our API endpoint and is integrated with Django Rest Framework. DRF expects a dictionary with all the keys present for the data model to correctly serialize them with appropriate types.

Even though I think that compatibility with Django and DRF should be considered as big chunk of es-dsl-py are going to be users of these frameworks, I don't want to make this issue about compatibility with them. It's a more generic issue. Let's say we have an endpoint that returns articles and their tags as nested objects:

class Article(DocType):
    title = Text()
    body = Text()
    tags = Nested(properties={
        'tag_name': String(),
        'tag_href': String(),
        'attributes_list': String(),
    })

Now if we index some data and query ES, API returns something similar to this:

[
  {
    "title": "Title 1",
    "body": "Body 1",
    "tags": [{
      "tag_name": "Tag 1",
      "tag_href": null,
      "attributes_list": []
    }]
  },
  {
    "title": "Title 2",
    "body": "Body 2",
    "tags": [{
      "tag_name": "Tag 2",
      "tag_href": "/tags/tag2",
      "attributes_list": ["attr1", "attr2"]
    }]
  },
  {
    "title": "Title 3",
    "body": null,
    "tags": []
  }
]

````

But after converting results to dictionary using `to_dict` and then converting exact same results back to JSON, here's what we get:

```json
[
  {
    "title": "Title 1",
    "body": "Body 1",
    "tags": [{
      "tag_name": "Tag 1"
    }]
  },
  {
    "title": "Title 2",
    "body": "Body 2",
    "tags": [{
      "tag_name": "Tag 2",
      "tag_href": "/tags/tag2",
      "attributes_list": ["attr1", "attr2"]
    }]
  },
  {
    "title": "Title 3",
  }
]

We now have a different data even though we didn't change anything. So in case of the last object in the results, attributes_list was an empty array. Now it's missing. Same goes for tag_href, body etc.

Basically, what I'm saying is: [] !=== undefined

GeorgeLubaretsi on 15 Jan 2018

👍2

I understand your issue. The reason why we decided to ignore those empty fields is:

from the dsl point of view there is no difference - accessing tags will return an empty list if not present on the document, accessing body would return None.
from the point of view of elasticsearch, there is again no difference
finally, sometimes we create those values as a side effect (like accessing tags on an empty document) and we don't want those to then bloat the document and the index

that said we could totally include a flag to control that behavior for when the data is shared with other systems that don't have similar properties.

HonzaKral on 19 Jan 2018

❤2

implemented in 0ee7feeadcc1fdcdb0817e5e0c6816c5f9cff728

HonzaKral on 19 Jan 2018

❤3

@HonzaKral Question, why isn't the save_empty parameter implemented into the update function as well. I have a field that is populated that needs to be updated with an empty value.