Elasticsearch: Lots of ReplicationTasks sitting around

Created on 4 Aug 2016 · 12Comments · Source: elastic/elasticsearch

User on the forum was seeing memory creeping up and up and up slowly, having to restart nodes. It turns out that he's accumulating ReplicationTasks in the task manager's map. We currently have no idea _why_ they are sticking around, but they are.

I'm marking this as a task manager issue even though it may very well be a "something else" issue. The task manager, but dint of being the lens through which we can look at Elasticsearch's guts is where we are starting.

:DistributeTask Management >bug v2.3.3

Source

nik9000

🎉1

Most helpful comment

@wadey and @dopey thanks a lot for providing an awesome bug report. The information that you provided and timely responses were really helpful and I was able to reproduce the issue locally. It looks like an issue is in the searchguard-ssl plugin that you are using. I opened floragunncom/search-guard-ssl/pull/31 It fixes the issue for me. Since it doesn't seem to be an elasticsearch bug, I am going to close this issue for now. Please, feel free to reopen if needed.

imotov on 4 Aug 2016

🎉3 👍1

All 12 comments

This seems suspicious: https://github.com/elastic/elasticsearch/blob/v2.3.3/core/src/main/java/org/elasticsearch/transport/RequestHandlerRegistry.java#L78-L80

This section is only unregistering the task if it fails. The only other way unregister gets called for a task is when handler calls channel.sendResponse (because TransportChannelWrapper unregisters it then). Are there any cases where sendResponse might not be getting called by the handler?

wadey on 4 Aug 2016

@wadey If sendResponse is not getting called it's a problem in itself. We can figure out where it happens by getting a list of currently running/leaking tasks or a heapdump.

imotov on 4 Aug 2016

This section is only unregistering the task if it fails. The only other way unregister gets called for a task is when handler calls channel.sendResponse (because TransportChannelWrapper unregisters it then). Are there any cases where sendResponse might not be getting called by the handler?

We always send a response even in the case of an exception. I wonder if they pile up due to a node connection drop and we wait for a cluster state update to retry? Do they go away at some point? We wait I think 60 sec by default so they might pile up in the case of a hick-up ?

s1monw on 4 Aug 2016

We can figure out where it happens by getting a list of currently running/leaking tasks or a heapdump.

@imotov I don't feel comfortable sharing the heapdump as it contains a lot of internal information, but I'm trying to look through it and help debug this (I work with Max the original reporter in the forums). I am going to try to enable debug/trace logging for the related Replication class files to see if I can see any errors or retries happening.

Do they go away at some point? We wait I think 60 sec by default so they might pile up in the case of a hick-up ?

@s1monw There are ReplicationTasks that are multiple days old inside of the tasks map. Example:

visualvm_1_3_8

wadey on 4 Aug 2016

@wadey Thanks! This is useful. Are most of these tasks say "indices:data/writes/bulk[s]" in the action field? Could you give any more information about the state of the cluster for the last few day? Are all shards available for this index? Have you seen any failures? Did you have a lot of node restarts?

The original forum post indicates that the following plugins are used: searchguard-ssl, stats, and head.

imotov on 4 Aug 2016

Hey @imotov. The cluster state has been green (all shards are available) since the last rolling restart which was ~ 1week ago. We did restart 1 node yesterday after we made a request to he _task api which caused an OOM on the server, but in general we don't see restarts until we need to do a rolling restart of the whole cluster due to memory pressure. Nothing in the logs to indicate any failures. No connection dropping, no master elections, just create mapping and update mapping lines from when the log index rolls over. We use 3 plugins: elasticsearch-statsd, elasticsearch-head, and search-guard-ssl. We have 2 clusters: a staging cluster with 3 servers, and a production cluster with 5 servers. The staging cluster has ~400million documents and the production one has ~1billion. We normally have ~2.5weeks in staging before we need to rolling restart the cluster, and ~1.5 weeks in production.

Looking into the action field question.