Elasticsearch-dsl-py: Truncated Search Results

Created on 9 Feb 2016 · 10Comments · Source: elastic/elasticsearch-dsl-py

Great project but I am running into issues with what appear to be truncated search results:

Currently have elasticsearch client 1.9.0 and 0.1.1 dsl installed via pip.

client = Elasticsearch("client_address")

s = Search(using=client, index="index_name")
.query("match", field1="field1_value")

response = s.execute()

for hit in response:
print(hit)

I am getting the hits I would expect in return but the results are being truncated for some reason.

{u'is_response_valid_json': True, u'request_attributes': {},...}

Just wanted to check if this is a known issue or potential user error on my part.

Source

erstwild

👍1

Most helpful comment

@Erstwild The result object has everything in it.

Try:

for hit in response:
   print(hit.to_dict())
   print(hit.meta.to_dict())

That should show you everything you want to see.

However, you don't need to do that to access each field in your document. If you have the following doc_type:

>>> Class MyDocs(DocType):
>>>    title = String()
>>>    created_at = Date()
>>>
>>> MyDocs.init()
>>>
>>> my_doc = MyDocs(title="my doc title")
>>> my_doc.save()

Then I can directly do things like this:

>>> r = MyDocs.search().execute()
>>> hit = r[0]  # (assuming it's the only doc in my index)
>>> hit.title
"my doc title"
>>> hit.title = "my changed doc title"
>>> hit.save()

In other words, you can use the hit.to_dict() for debugging or whatever, but it's not really needed otherwise.

njoannin on 9 Feb 2016

🎉1 👍1

All 10 comments

@Erstwild The default behaviour is to return 10 hits.

If you need more, you can do so like this:

response = s[0:100].execute()

If you want all the hits (be careful not to ask for too much at once):

count = s.count()
response = s[0:count].execute()

If you want to paginate, just follow the instructions in the documentation.

Hope this helps.

njoannin on 9 Feb 2016

👍2

@njoannin

That is good to know as well. I meant the responses themselves appear to be truncated like a I am getting only getting the metadata for the returned documents. Is there a way to easily access/view/prettyprint the entire documents for the search response without have to specify all the field elements?

For example:

for hit in response:
print(hit.response)

Allows me to access/partially print one object called "response"

erstwild on 9 Feb 2016

@Erstwild The result object has everything in it.

Try:

for hit in response:
   print(hit.to_dict())
   print(hit.meta.to_dict())

That should show you everything you want to see.

However, you don't need to do that to access each field in your document. If you have the following doc_type:

>>> Class MyDocs(DocType):
>>>    title = String()
>>>    created_at = Date()
>>>
>>> MyDocs.init()
>>>
>>> my_doc = MyDocs(title="my doc title")
>>> my_doc.save()

Then I can directly do things like this:

>>> r = MyDocs.search().execute()
>>> hit = r[0]  # (assuming it's the only doc in my index)
>>> hit.title
"my doc title"
>>> hit.title = "my changed doc title"
>>> hit.save()

In other words, you can use the hit.to_dict() for debugging or whatever, but it's not really needed otherwise.

njoannin on 9 Feb 2016

🎉1 👍1

@njoannin

Excellent! Thanks for putting me on the right path. My only other question is whether there is anything to be mindful of when filtering dates/datetimes?

client = Elasticsearch("client_address")

s = Search(using=client, index="index_name")
.filter("term", response_datetime="2015-12-11 19:13:25,151")
.query("match", field1="field1_value")

response = s.execute()

for hit in response:
print(response.hits.total)
print(json.dumps(hit.to_dict(), sort_keys=True, indent=4))
print(json.dumps(hit.meta.to_dict(), sort_keys=True, indent=4))

Everything works perfectly except with addition of the datetime filter. I have validated that the datetime value exists in the document where the field1="field1_value" case is true.

erstwild on 10 Feb 2016

The best way is to just use python's datetime objects, that way it should work. Also note that exact filtering on datetimes might be a bit tricky since they are internally stored as longs (millisecond timestamp) so it might require you to use that value.

Hope this helps

HonzaKral on 10 Feb 2016

@HonzaKral

Hello Honza-Thanks for the pointer. I ended up utilizing the python datetime option but I still seem to be getting the same behavior for some reason:

from elasticsearch_dsl import Search, Q
from elasticsearch import Elasticsearch
import datetime as datetime

target_datetime_string = "2015-12-11 19:13:25,151"

def format_time(datetime_string):
temp_datetime_1 = datetime.datetime.strptime(target_datetime_string, "%Y-%m-%d %H:%M:%S,%f")
temp_datetime_2 = datetime.datetime.strftime(temp_datetime_1, "%Y-%m-%d %H:%M:%S,%f")[:-3]
return temp_datetime_2

target_datetime = format_time(target_datetime_string)

client = Elasticsearch("client_address")

s = Search(using=client, index="index_name")
.filter("term", response_datetime=target_datetime)
.query("match", field1="field1_value")

response = s.execute()

for hit in response:
print(response.hits.total)
print(json.dumps(hit.to_dict(), sort_keys=True, indent=4))
print(json.dumps(hit.meta.to_dict(), sort_keys=True, indent=4))

This still ends up returning no hits for me when I have validated it should. Works fine with the filter criteria removed and just the match criteria.

erstwild on 10 Feb 2016

@Erstwild Have you tried filtering for a range of datetime values?

s = Search(using=client, index="index_name")\
      .filter("range", response_datetime={'gte': your_datetime - 1s*, 'lt': your_datetime + 1s*})\
      .query("match", field1="field1_value")

(*Putting in the correct values here)

That should at least pull out the hit you're looking for (positive control), and if it doesn't, it might help you figure out what is going on...

njoannin on 10 Feb 2016

@njoannin @HonzaKral

Not sure what to conclude at this point,

from elasticsearch_dsl import Search, Q
from elasticsearch import Elasticsearch
import datetime as datetime

target_datetime_string = "2015-12-11 19:13:10,262"

def format_time_gte(datetime_string):
    temp_datetime_1 = datetime.datetime.strptime(target_datetime_string, "%Y-%m-%d %H:%M:%S,%f") - datetime.timedelta(seconds=3)
    temp_datetime_2 = datetime.datetime.strftime(temp_datetime_1, "%Y-%m-%d %H:%M:%S,%f")[:-3]
    return temp_datetime_2

def format_time_lte(datetime_string):
    temp_datetime_1 = datetime.datetime.strptime(target_datetime_string, "%Y-%m-%d %H:%M:%S,%f") + datetime.timedelta(seconds=3)
    temp_datetime_2 = datetime.datetime.strftime(temp_datetime_1, "%Y-%m-%d %H:%M:%S,%f")[:-3]
    return temp_datetime_2

target_datetime_gte = format_time_gte(target_datetime_string)
target_datetime_lte = format_time_lte(target_datetime_string)

client = Elasticsearch("client")
s = Search(using=client, index="index_name")\
    .filter("range", response_datetime={"gte": target_datetime_gte, "lte": target_datetime_lte})\
    .query("match", field1="field1_value") \

response = s.execute()

for hit in response:
   print(response.hits.total)
   print(json.dumps(hit.to_dict(), sort_keys=True, indent=4))
   print(json.dumps(hit.meta.to_dict(), sort_keys=True, indent=4))

Still same story unfortunately

erstwild on 10 Feb 2016

@Erstwild

Unless I am mistaken, your format_time_* functions return strings, which is what @HonzaKral was telling you not to use for specific searches. You could certainly simplify them by simply returning the actual datetime objects (in your functions, returning temp_datetime_1 would do the trick).

That said, I'm not quite sure that would cause the problem in the range filtering...

Could you confirm that your response_datetime field is correctly defined as a Date in your index?

njoannin on 11 Feb 2016

I know this is closed but I recently came across what I believe is the same issue. After poking around it looks like it's due to AttrDict.__repr__.

https://github.com/elastic/elasticsearch-dsl-py/blob/master/elasticsearch_dsl/utils.py#L102

At first I thought it was just on the search side when returning results in order to make them pretty, but if I'm reading this correctly, it's also called when you save a document, so it gets stored in elasticsearch truncated.

Better idea of what I'm experiencing (and what it sounds like the OP was experiencing):

In [3]: res = es.search(index='messages', doc_type='socket_out', body={'query': {'match': {'<sensitive stuff here>'}}})

In [4]: res['hits']['total']
Out[4]: 203

In [5]: for doc in res['hits']['hits']:
   ...:     print doc['_id'], doc['_source']['kwargs']
   ...:
AV1e1iH2g0FpsVnHXI4- {'source': '597049ea04472147f51d37e3', 'to': '59704bd0044721...}
AV1e1iINaQ35CgFhhkoa {'source': '597049ea04472147f51d37e3', 'to': '59704bd0044721...}
AV1e19ZIi21-tS_pAv7f {'source': '54f64fb1d966207209915af7', 'to': '54f652c4d96620...}
AV1e2wYag0FpsVnHXI7e {'source': '54f652c4d96620720a2f1899', 'to': '54f64fb1d96620...}
AV1e5WxMaQ35CgFhhkwR {'source': '552de97be1cc881623725658', 'to': '556f2c48c5f7f6...}
AV1fADXDi21-tS_pAwN6 {'source': '597049ea04472147f51d37e3', 'video_id': u'2_MX40N...}
AV1fAhXTi21-tS_pAwPX {'source': '597049ea04472147f51d37e3', 'to': '59704bd0044721...}
AV1fAu0Ni21-tS_pAwQB {'source': '597049ea04472147f51d37e3', 'to': '597049ea044721...}
AV1fBOBjaQ35CgFhhk_Z {'source': '5594238c268a0c1ff3ea7f1a', 'video_id': u'2_MX40N...}
AV1fCN3ni21-tS_pAwUq {'source': '5594238c268a0c1ff3ea7f1a', 'to': '55943b4960032e...}

Messages are being saved via elasticsearch-dsl and the kwargs field in this case is supposed to be serialized JSON, instead we're stuck with a bunch of truncated strings.