Describe the feature:
Elasticsearch version (bin/elasticsearch --version):
Version: 6.1.2, Build: 5b1fea5/2018-01-10T02:35:59.208Z, JVM: 1.8.0_181
Plugins installed: [] None
JVM version (java -version):
1.8.0_181
OS version (uname -a if on a Unix-like system):
CentOS Linux release 7.5.1804 (Core)
Description of the problem including expected versus actual behavior:
org.apache.http.concurrent.BasicFuture objects keep building up in jvm heap consuming most of the heap space eventually leading to OutOfMemory error when elasticsearch is under load and does not respond to performRequest call or takes more than 30 seconds to respond.
Error reported by the library in this case is
java.io.IOException: listener timeout after waiting for [30000] ms
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
var http = require('http');
var timeout = 100000; //sleep 100 seconds
http.createServer(function (req, res) {
setTimeout((function() {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end("Hello I am awake");
}), timeout);
}).listen(9200);
Provide logs (if relevant):
Workaround:
This issues is not seen when RestClient.setMaxRetryTimeoutMillis(100000) is used which results in SocketTimeout instead of listener timeout as shown below
Socket timeout Exception:
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [ACTIVE]
Timeout
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
127.0.0.1:52928<->127.0.0.1:9200[ACTIVE] [rw:w]: Shutdown
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1]
connection aborted
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Releasing
connection: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 1 of 10; total allocated: 1 of 30]
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection
released: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 0 of 10; total allocated: 0 of 30]
11:11:17.222 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
request [POST http://127.0.0.1:9200/_bulk] failed
java.net.SocketTimeoutException: null
at
org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375)
at
org.apache.http.impl.nio.client.InternalRequestExecutor.timeout(InternalRequestExecutor.java:116)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
at
org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at
org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
at java.lang.Thread.run(Thread.java:748)
11:11:17.223 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
added host [http://127.0.0.1:9200] to blacklist
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
0.0.0.0:52928<->127.0.0.1:9200[CLOSED][]: Shutdown
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [CLOSED]:
Disconnected
Pinging @elastic/es-core-infra
@appadesai
Monitor the heap objects after every 15 minutes and you will notice buildup of BasicFuture objects with strong references from the GC root which do not get released.
Do individual objects never get released (i.e. if you add logic to your tests that eventually stops the firing of requests in a hot loop) or is the overall number of these futures simply never going down?
It seems the former would be a problem while the latter is some what expected with the way the async HTTP client works in the background (you can enqueue requests quicker than they are potentially handled by the server -> your queue grows).
The objects eventually get cleaned up if I stop sending the requests but this behavior is not seen when there is continuous requests to ElasticSearch under load. The cleanup happens quickly if the timeout (RestClient.setMaxRetryTimeoutMillis ) is increased to 100 seconds even with continuous load.
I am not sure what are you referring to as former and latter.
@appadesai
The cleanup happens quickly if the timeout (RestClient.setMaxRetryTimeoutMillis ) is increased to 100 seconds even with continuous load.
Interesting, will take a look and see if I can reproduce this :)
@appadesai I can't find any different behaviour here depending on the client side timeout (30s vs 100s). Waht's the Java code you use for testing here? (maybe you're leaking resources somehow in your tests and trigger the different behaviour that way?)
@original-brownbear I have uploaded test code that should help you reproduce the issue.
You will have to install nodejs from https://nodejs.org/en/download/ and start the dummy server as follows
bin/node node_app.js
I have added two timeouts in the test client code (ESClientTest.java)
private final int maxRetryTimeoutMillis = 30000; //Timeout to reproduce leak
//private final int maxRetryTimeoutMillis = 200000; //Timeout to fix leak
Uncomment each one to reproduce or avoid the issue.
Also added screen shot of the profiler to show the growth of BasicFuture objects. Notice that these objects are strongly reachable.

Ok, I can confirm that the behaviour here is somewhat sub-optimal.
If we set a timeout smaller than the time taken by the request, then we will not interrupt the request at all only throw an exception from org.elasticsearch.client.RestClient.SyncResponseListener#get.
=> we're leaking memory here if we are retrying requests by user code quicker than they time out on the HTTP client level because we never interrupt any requests.
Is there any workaround for this? I am experiencing this exact same problem. After some time, PoolingNHttpClientConnectionManager was 3GB in size! This was accumulated in the CPool leasingRequests LinkedList, containing more than 200,000 instances of BasicFuture, each at over 11KB.
I need to find a solution to prevent this behavior, so that memory is cleaned up appropriately.
We changed to maxRetryTimeout on restClient from default 30 seconds to 200
seconds as a workaround to fix this issue.
RestClient.setMaxRetryTimeoutMillis(200000)
On Wed, Jan 30, 2019 at 8:40 AM nathankopp notifications@github.com wrote:
Is there any workaround for this? I am experiencing this exact same
problem. After some time, PoolingNHttpClientConnectionManager was 3GB in
size! This was accumulated in the CPool leasingRequests LinkedList,
containing more than 200,000 instances of BasicFuture, each at over 11KB.I need to find a solution to prevent this behavior, so that memory is
cleaned up appropriately.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/33342#issuecomment-458795045,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASrolk2mpG3jTp0H52P0MQHpRWZv_0pqks5vIQ0ogaJpZM4WXDnr
.
I'm using ES client version 6.3.2, and I don't see a setMaxRetryTimeoutMillis() method in RestClient. However, I do see a similarly-named method in RestClientBuilder, so hopefully setting that to 200000 will have the same effect.
Increasing the maxRetryTimeout seems to have fixed our problem. When using the longer timeout, the memory leak does not occur. I would say that this is not a case of "sub-optimal performance", but rather should be considered a true bug and should be fixed.
agreed @nathankopp see #38425 and #38085 .
Thanks for the detailed explanation. We've got other reports of issues caused by the max retry timeout, and the more I looked into it the more I realized that this timeout mechanism did not play well with the internals of the underlying http client that we rely on. I think it's better to rely on the different timeouts exposed by the http client directly and remove this additional mechanism that ends up causing more harm than good. We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085).
Most helpful comment
Increasing the maxRetryTimeout seems to have fixed our problem. When using the longer timeout, the memory leak does not occur. I would say that this is not a case of "sub-optimal performance", but rather should be considered a true bug and should be fixed.