Elasticsearch-dsl-py: http.client.LineTooLong while using `scan()`

Created on 27 May 2018  路  4Comments  路  Source: elastic/elasticsearch-dsl-py

While using scan I get an exception of http.client.LineTooLong.
It's a non deterministic exception, which i learn from here that the problem is in elasticsearch.

how I can fix it?

Thank you

the stacktrace:

GET http://docker21751-env-8566034.hidora.com:9200/documents/doc/_search?scroll=5m&size=1000 [status:N/A request:0.189s]
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 260, in _read_status
    raise LineTooLong("status line")
http.client.LineTooLong: got more than 65536 bytes when reading status line

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 166, in perform_request
    response = self.pool.urlopen(method, url, body, retries=False, headers=request_headers, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/util/retry.py", line 333, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 260, in _read_status
    raise LineTooLong("status line")
urllib3.exceptions.ProtocolError: ('Connection aborted.', LineTooLong('got more than 65536 bytes when reading status line',))

Most helpful comment

What is happening is that once your process forks both resulting processes share the same socket connection to elasticsearch and so they can start reading the same stream resulting in parse failures.

The proper solution here is to (re)open the connection only inside the worker processes.

Hope this helps!

All 4 comments

how does the code that produced this exception look like, are you using any fork or multiprocessing?

Hi Honza, thank you for reply.

I am using ProcessPoolExecutor (built on multiprocessing) and each process execute scan()

my documents might have very long texts, and I assume that the response its too long raise that exception.

continue with scans() exceptions I am also facing:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/serializer.py", line 38, in loads
    return json.loads(s)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 69432 (char 69431)

Traceback (most recent call last):
  File "/Users/amihollander/Eclipse/WorkspaceMain/pii-analysis/piianalysis/piianalysis/analyse_service.py", line 29, in batch_process
    for hit in search:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch_dsl/search.py", line 701, in scan
    **self._params
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 364, in scan
    request_timeout=request_timeout, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 655, in search
    doc_type, '_search'), params=params, body=body)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/transport.py", line 345, in perform_request
    data = self.deserializer.loads(data, headers_response.get('content-type'))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/serializer.py", line 81, in loads
    return deserializer.loads(s)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/serializer.py", line 40, in loads
    raise SerializationError(s, e)
elasticsearch.exceptions.SerializationError:

The data around the offset is (I sanitized the urls due to sensitivity):

{"_index":"documents","_type":"doc","_id":"http://x.html","_score":null,"_source":
{"id":"http://x.html"},"sort":[2082862]},{"_icle.aspx/21532"}"sort":[2084432]},
{"_index":"documents","_type":"doc","_id":"http://y","_score":null,"_source":{"id":"http://y"},"sort":
[2084433]},{"_index":"documents","_type":"doc","_id":"http://z","_score":null,"_source":
{"id":"http://z"},"sort":[2084486]}

the column 69432 is in the second line {"_icle.aspx/21532"}

What is happening is that once your process forks both resulting processes share the same socket connection to elasticsearch and so they can start reading the same stream resulting in parse failures.

The proper solution here is to (re)open the connection only inside the worker processes.

Hope this helps!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

beanaroo picture beanaroo  路  4Comments

gabrielpjordao picture gabrielpjordao  路  3Comments

zahir-koradia picture zahir-koradia  路  3Comments

MauriJHN picture MauriJHN  路  4Comments

barseghyanartur picture barseghyanartur  路  4Comments