Elasticsearch: Memory leak in RestClient when elasticsearch takes more than 30 seconds to respond

Created on 3 Sep 2018 · 13Comments · Source: elastic/elasticsearch

Describe the feature:

Elasticsearch version (bin/elasticsearch --version):
Version: 6.1.2, Build: 5b1fea5/2018-01-10T02:35:59.208Z, JVM: 1.8.0_181

Plugins installed: [] None

JVM version (java -version):
1.8.0_181

OS version (uname -a if on a Unix-like system):
CentOS Linux release 7.5.1804 (Core)

Description of the problem including expected versus actual behavior:
org.apache.http.concurrent.BasicFuture objects keep building up in jvm heap consuming most of the heap space eventually leading to OutOfMemory error when elasticsearch is under load and does not respond to performRequest call or takes more than 30 seconds to respond.

Error reported by the library in this case is
java.io.IOException: listener timeout after waiting for [30000] ms

Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

To simulate loaded elasticsearch server, start a webserver to respond to any query with a delay of 200 seconds
Ex: nodejs server with following script

          var http = require('http');
          var timeout = 100000; //sleep 100 seconds
          http.createServer(function (req, res) {
                 setTimeout((function() {
                       res.writeHead(200, {'Content-Type': 'text/plain'});
                            res.end("Hello I am awake");
                 }), timeout);
          }).listen(9200);

Create a java threadpool of 32 threads and continuously submit tasks which should be picked up by each of this thread in the pool
Each thread upon getting the request should make a performRequest call with a StringEntity of about 1MB size on a single instance of RestClient shared across all the threads. In our scenario, we make a _bulk request to elasticsearch with 500 update operations.
Monitor the heap objects after every 15 minutes and you will notice buildup of BasicFuture objects with strong references from the GC root which do not get released.

Provide logs (if relevant):

Workaround:
This issues is not seen when RestClient.setMaxRetryTimeoutMillis(100000) is used which results in SocketTimeout instead of listener timeout as shown below
Socket timeout Exception:
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [ACTIVE]
Timeout
11:11:17.216 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
127.0.0.1:52928<->127.0.0.1:9200[ACTIVE] [rw:w]: Shutdown
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1]
connection aborted
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Releasing
connection: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 1 of 10; total allocated: 1 of 30]
11:11:17.218 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection
released: [id: http-outgoing-0][route: {}->http://127.0.0.1:9200][total kept
alive: 0; route allocated: 0 of 10; total allocated: 0 of 30]
11:11:17.222 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
request [POST http://127.0.0.1:9200/_bulk] failed
java.net.SocketTimeoutException: null
at
org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375)
at
org.apache.http.impl.nio.client.InternalRequestExecutor.timeout(InternalRequestExecutor.java:116)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
at
org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
at
org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213)
at
org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at
org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at
org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
at java.lang.Thread.run(Thread.java:748)
11:11:17.223 [I/O dispatcher 1] DEBUG org.elasticsearch.client.RestClient -
added host [http://127.0.0.1:9200] to blacklist
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.conn.ManagedNHttpClientConnectionImpl - http-outgoing-0
0.0.0.0:52928<->127.0.0.1:9200[CLOSED][]: Shutdown
11:11:17.224 [I/O dispatcher 1] DEBUG
org.apache.http.impl.nio.client.InternalIODispatch - http-outgoing-0 [CLOSED]:
Disconnected

:CorFeatureJava Low Level REST Client >bug

Source

appadesai

👍2

Most helpful comment

Increasing the maxRetryTimeout seems to have fixed our problem. When using the longer timeout, the memory leak does not occur. I would say that this is not a case of "sub-optimal performance", but rather should be considered a true bug and should be fixed.

nathankopp on 1 Feb 2019

👍2

All 13 comments

Pinging @elastic/es-core-infra

elasticmachine on 3 Sep 2018

@appadesai

Monitor the heap objects after every 15 minutes and you will notice buildup of BasicFuture objects with strong references from the GC root which do not get released.

Do individual objects never get released (i.e. if you add logic to your tests that eventually stops the firing of requests in a hot loop) or is the overall number of these futures simply never going down?

It seems the former would be a problem while the latter is some what expected with the way the async HTTP client works in the background (you can enqueue requests quicker than they are potentially handled by the server -> your queue grows).

original-brownbear on 3 Sep 2018

The objects eventually get cleaned up if I stop sending the requests but this behavior is not seen when there is continuous requests to ElasticSearch under load. The cleanup happens quickly if the timeout (RestClient.setMaxRetryTimeoutMillis ) is increased to 100 seconds even with continuous load.

I am not sure what are you referring to as former and latter.

appadesai on 3 Sep 2018

@appadesai

The cleanup happens quickly if the timeout (RestClient.setMaxRetryTimeoutMillis ) is increased to 100 seconds even with continuous load.

Interesting, will take a look and see if I can reproduce this :)

original-brownbear on 3 Sep 2018

@appadesai I can't find any different behaviour here depending on the client side timeout (30s vs 100s). Waht's the Java code you use for testing here? (maybe you're leaking resources somehow in your tests and trigger the different behaviour that way?)

original-brownbear on 3 Sep 2018

ESClientTest.zip

@original-brownbear I have uploaded test code that should help you reproduce the issue.

You will have to install nodejs from https://nodejs.org/en/download/ and start the dummy server as follows
bin/node node_app.js

I have added two timeouts in the test client code (ESClientTest.java)
private final int maxRetryTimeoutMillis = 30000; //Timeout to reproduce leak
//private final int maxRetryTimeoutMillis = 200000; //Timeout to fix leak
Uncomment each one to reproduce or avoid the issue.

Also added screen shot of the profiler to show the growth of BasicFuture objects. Notice that these objects are strongly reachable.
basicfutureleak

appadesai on 4 Sep 2018

Ok, I can confirm that the behaviour here is somewhat sub-optimal.
If we set a timeout smaller than the time taken by the request, then we will not interrupt the request at all only throw an exception from org.elasticsearch.client.RestClient.SyncResponseListener#get.
=> we're leaking memory here if we are retrying requests by user code quicker than they time out on the HTTP client level because we never interrupt any requests.

original-brownbear on 4 Sep 2018

Is there any workaround for this? I am experiencing this exact same problem. After some time, PoolingNHttpClientConnectionManager was 3GB in size! This was accumulated in the CPool leasingRequests LinkedList, containing more than 200,000 instances of BasicFuture, each at over 11KB.

I need to find a solution to prevent this behavior, so that memory is cleaned up appropriately.

nathankopp on 30 Jan 2019

We changed to maxRetryTimeout on restClient from default 30 seconds to 200
seconds as a workaround to fix this issue.
RestClient.setMaxRetryTimeoutMillis(200000)

On Wed, Jan 30, 2019 at 8:40 AM nathankopp notifications@github.com wrote:

Is there any workaround for this? I am experiencing this exact same
problem. After some time, PoolingNHttpClientConnectionManager was 3GB in
size! This was accumulated in the CPool leasingRequests LinkedList,
containing more than 200,000 instances of BasicFuture, each at over 11KB.

I need to find a solution to prevent this behavior, so that memory is
cleaned up appropriately.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/elastic/elasticsearch/issues/33342#issuecomment-458795045,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASrolk2mpG3jTp0H52P0MQHpRWZv_0pqks5vIQ0ogaJpZM4WXDnr
.

appadesai on 30 Jan 2019

I'm using ES client version 6.3.2, and I don't see a setMaxRetryTimeoutMillis() method in RestClient. However, I do see a similarly-named method in RestClientBuilder, so hopefully setting that to 200000 will have the same effect.

nathankopp on 30 Jan 2019

👍2

nathankopp on 1 Feb 2019

👍2

agreed @nathankopp see #38425 and #38085 .

javanna on 5 Feb 2019

Thanks for the detailed explanation. We've got other reports of issues caused by the max retry timeout, and the more I looked into it the more I realized that this timeout mechanism did not play well with the internals of the underlying http client that we rely on. I think it's better to rely on the different timeouts exposed by the http client directly and remove this additional mechanism that ends up causing more harm than good. We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085).

javanna on 6 Feb 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings