Elasticsearch: Lots of ReplicationTasks sitting around

Created on 4 Aug 2016  路  12Comments  路  Source: elastic/elasticsearch

User on the forum was seeing memory creeping up and up and up slowly, having to restart nodes. It turns out that he's accumulating ReplicationTasks in the task manager's map. We currently have no idea _why_ they are sticking around, but they are.

I'm marking this as a task manager issue even though it may very well be a "something else" issue. The task manager, but dint of being the lens through which we can look at Elasticsearch's guts is where we are starting.

:DistributeTask Management >bug v2.3.3

Most helpful comment

@wadey and @dopey thanks a lot for providing an awesome bug report. The information that you provided and timely responses were really helpful and I was able to reproduce the issue locally. It looks like an issue is in the searchguard-ssl plugin that you are using. I opened floragunncom/search-guard-ssl/pull/31 It fixes the issue for me. Since it doesn't seem to be an elasticsearch bug, I am going to close this issue for now. Please, feel free to reopen if needed.

All 12 comments

This seems suspicious: https://github.com/elastic/elasticsearch/blob/v2.3.3/core/src/main/java/org/elasticsearch/transport/RequestHandlerRegistry.java#L78-L80

This section is only unregistering the task if it fails. The only other way unregister gets called for a task is when handler calls channel.sendResponse (because TransportChannelWrapper unregisters it then). Are there any cases where sendResponse might not be getting called by the handler?

@wadey If sendResponse is not getting called it's a problem in itself. We can figure out where it happens by getting a list of currently running/leaking tasks or a heapdump.

This section is only unregistering the task if it fails. The only other way unregister gets called for a task is when handler calls channel.sendResponse (because TransportChannelWrapper unregisters it then). Are there any cases where sendResponse might not be getting called by the handler?

We always send a response even in the case of an exception. I wonder if they pile up due to a node connection drop and we wait for a cluster state update to retry? Do they go away at some point? We wait I think 60 sec by default so they might pile up in the case of a hick-up ?

We can figure out where it happens by getting a list of currently running/leaking tasks or a heapdump.

@imotov I don't feel comfortable sharing the heapdump as it contains a lot of internal information, but I'm trying to look through it and help debug this (I work with Max the original reporter in the forums). I am going to try to enable debug/trace logging for the related Replication class files to see if I can see any errors or retries happening.

Do they go away at some point? We wait I think 60 sec by default so they might pile up in the case of a hick-up ?

@s1monw There are ReplicationTasks that are multiple days old inside of the tasks map. Example:

visualvm_1_3_8

@wadey Thanks! This is useful. Are most of these tasks say "indices:data/writes/bulk[s]" in the action field? Could you give any more information about the state of the cluster for the last few day? Are all shards available for this index? Have you seen any failures? Did you have a lot of node restarts?

The original forum post indicates that the following plugins are used: searchguard-ssl, stats, and head.

Hey @imotov. The cluster state has been green (all shards are available) since the last rolling restart which was ~ 1week ago. We did restart 1 node yesterday after we made a request to he _task api which caused an OOM on the server, but in general we don't see restarts until we need to do a rolling restart of the whole cluster due to memory pressure. Nothing in the logs to indicate any failures. No connection dropping, no master elections, just create mapping and update mapping lines from when the log index rolls over. We use 3 plugins: elasticsearch-statsd, elasticsearch-head, and search-guard-ssl. We have 2 clusters: a staging cluster with 3 servers, and a production cluster with 5 servers. The staging cluster has ~400million documents and the production one has ~1billion. We normally have ~2.5weeks in staging before we need to rolling restart the cluster, and ~1.5 weeks in production.

Looking into the action field question.

Are most of these tasks say "indices:data/writes/bulk[s]" in the action field?

@imotov I used MAT to do a group by, here are the ReplicationTasks grouped by action:

eclipse_memory_analyzer

@wadey and @dopey thanks a lot for providing an awesome bug report. The information that you provided and timely responses were really helpful and I was able to reproduce the issue locally. It looks like an issue is in the searchguard-ssl plugin that you are using. I opened floragunncom/search-guard-ssl/pull/31 It fixes the issue for me. Since it doesn't seem to be an elasticsearch bug, I am going to close this issue for now. Please, feel free to reopen if needed.

Bug is fixed with version 15 of Search Guard SSL (released today). Thx @imotov for providing a fix.

Thanks @imotov @nik9000 @floragunncom (and anyone else I missed) for the quick work in investigating and getting this resolved!

Hi, any idea when this fix will made available on a new version of elasticsearch? Thanks

@trentcioran The issue is with the plugin, not Elasticsearch, and a fix for the plugin has been made available.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abtpst picture abtpst  路  3Comments

Praveen82 picture Praveen82  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

makeyang picture makeyang  路  3Comments

brwe picture brwe  路  3Comments