While using scan I get an exception of http.client.LineTooLong.
It's a non deterministic exception, which i learn from here that the problem is in elasticsearch.
how I can fix it?
Thank you
the stacktrace:
GET http://docker21751-env-8566034.hidora.com:9200/documents/doc/_search?scroll=5m&size=1000 [status:N/A request:0.189s]
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 260, in _read_status
raise LineTooLong("status line")
http.client.LineTooLong: got more than 65536 bytes when reading status line
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 166, in perform_request
response = self.pool.urlopen(method, url, body, retries=False, headers=request_headers, **kw)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/util/retry.py", line 333, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 260, in _read_status
raise LineTooLong("status line")
urllib3.exceptions.ProtocolError: ('Connection aborted.', LineTooLong('got more than 65536 bytes when reading status line',))
how does the code that produced this exception look like, are you using any fork or multiprocessing?
Hi Honza, thank you for reply.
I am using ProcessPoolExecutor (built on multiprocessing) and each process execute scan()
my documents might have very long texts, and I assume that the response its too long raise that exception.
continue with scans() exceptions I am also facing:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/serializer.py", line 38, in loads
return json.loads(s)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 69432 (char 69431)
Traceback (most recent call last):
File "/Users/amihollander/Eclipse/WorkspaceMain/pii-analysis/piianalysis/piianalysis/analyse_service.py", line 29, in batch_process
for hit in search:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch_dsl/search.py", line 701, in scan
**self._params
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 364, in scan
request_timeout=request_timeout, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped
return func(*args, params=params, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 655, in search
doc_type, '_search'), params=params, body=body)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/transport.py", line 345, in perform_request
data = self.deserializer.loads(data, headers_response.get('content-type'))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/serializer.py", line 81, in loads
return deserializer.loads(s)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/elasticsearch/serializer.py", line 40, in loads
raise SerializationError(s, e)
elasticsearch.exceptions.SerializationError:
The data around the offset is (I sanitized the urls due to sensitivity):
{"_index":"documents","_type":"doc","_id":"http://x.html","_score":null,"_source":
{"id":"http://x.html"},"sort":[2082862]},{"_icle.aspx/21532"}"sort":[2084432]},
{"_index":"documents","_type":"doc","_id":"http://y","_score":null,"_source":{"id":"http://y"},"sort":
[2084433]},{"_index":"documents","_type":"doc","_id":"http://z","_score":null,"_source":
{"id":"http://z"},"sort":[2084486]}
the column 69432 is in the second line {"_icle.aspx/21532"}
What is happening is that once your process forks both resulting processes share the same socket connection to elasticsearch and so they can start reading the same stream resulting in parse failures.
The proper solution here is to (re)open the connection only inside the worker processes.
Hope this helps!
Most helpful comment
What is happening is that once your process
forks both resulting processes share the same socket connection toelasticsearchand so they can start reading the same stream resulting in parse failures.The proper solution here is to (re)open the connection only inside the worker processes.
Hope this helps!