Node-slack-sdk: Memory leak on RTM client automatic reconnects

Created on 18 Jul 2016 · 15Comments · Source: slackapi/node-slack-sdk

I'm still looking into the root cause of this, but noticed that there appears to be a memory leak when an RTM client loses its connection and performs automatic reconnects.

needs info v3

Source

mpcowan

Most helpful comment

Well poo.

DEGoodmanWilson on 25 Jul 2016

😄3

All 15 comments

Thanks for the heads-up! Let us know what your research turns up!

DEGoodmanWilson on 18 Jul 2016

I just upgraded my client from 2.0.6 to 3.5.1, and I now notice the memory leak as well. This is very annoying, as it always ends up crashing against the Heroku limit. I'm trying to fix it, will report if I find something

mvaragnat on 25 Jul 2016

For the record, adding autoReconnect: false to the RTM client does not change the problem

mvaragnat on 25 Jul 2016

Downgrading back to 2.0.6 has solved the problem (@DEGoodmanWilson, if I were you I'd bump up the priority of that issue)

mvaragnat on 25 Jul 2016

Well poo.

DEGoodmanWilson on 25 Jul 2016

😄3

Is this issue still happening? I'm looking at node-slack-sdk to implement a bot with a very large number of connections, so any feedback is most welcome :smile: @mvaragnat are you still on 2.0.X? Thanks!

thbar on 10 Oct 2016

Also maybe worth renaming the issue from "Memory leak on RTM client automatic reconnects" to "Memory leak on RTM", if it is confirmed that the leak occurs even without automatic reconnect?

thbar on 10 Oct 2016

Hello @thbar I think the title is appropriate because most memory issues I have seen of late are linked to Slack API disconnects. When Slack forces many connections to close (several hundreds at once - like a server hiccup) my app would usually explode in memory usage and crash. However, it's kind of good thing, because it cleans the state and sets up a clean reconnect...

I am still on 2.0 but more because of "if it works, don't fix it". I think I traced the most aggravating issue to the New Relic monitoring module (go debug a memory leak when it's caused by the monitoring tool!)

Ping @DEGoodmanWilson by the way

mvaragnat on 10 Oct 2016

👍1

it sounds like what we need is a good reproduction case. if anyone who has experienced this issue (@mvaragnat, @mpcowan) has an guidance on how i'd be able to set up a test case for this, would you mind adding some implementation details? i could try spinning up a Docker container with a tiny mem limit, set up a couple hundred connections to the RTM API in my node process, and then simulate a disconnect. Does that sound adequate?

aoberoi on 2 Nov 2016

I think that would be a good way. I suggest you test with 1000 connections disconnecting at the exact same moment (like when the RTM server freezes, pong gets too old, and all connections reset at the exact same time).

Try to compare 3.6 to 2.0.6 in terms of memory management, perhaps. I found that 2.0.6 was not giving nearly as many problems as 3.X

mvaragnat on 2 Nov 2016

👍2

So I ran an experiment and I have some results to share:

screen shot 2016-11-11 at 3 38 27 pm

I ran my experiment program locally to analyze the memory behavior. What happens is that 1000 RTM connections are made, and as they all finally connect, i disconnect the network. Then when they are all in an ATTEMPTING_RECONNECT state, and a couple of GC events have occurred, I reconnect the network. I did this cycle ~7 times to observe the above.

If there were a memory leak, I would expect to see the memory usage rise unboundedly. Instead, the memory usage seems stable, except for during the first occurrence of reconnection attempts. I believe this is the expected behavior, and since the memory usage does not seem unbounded in other retries, I don't think we have a memory leak.

You can take a look at my raw results here: https://docs.google.com/spreadsheets/d/1wCKtZtOyTMFgIwG0fVoJCVN0DeJFWrErZHQcbucYLxM/edit?usp=sharing.

Feel free to give the experiment a go yourself, or even better, take a look at it and see if I am missing anything. I'd love to see your results.

As far as next steps go:

Can you all tell me which versions of the SDK and versions of node you have observed your issues? I used the latest SDK: 3.6.0 and node v7.0.0 for this graph.
Do you believe its possible that your dynos/containers/VMs/droplets/instances are just under-provisioned? Perhaps you are running into the memory ceiling of the resources you have in your environment and the process is crashing simply because it cannot service this type of demand. The data I collected suggests that you will need ~20%-~25% headroom on memory to deal with a massive disconnect event, but that the memory usage will return to a normal level.

cc @mvaragnat @thbar @DEGoodmanWilson @mpcowan

aoberoi on 12 Nov 2016

👍2

Thank you @aoberoi for this great analysis. I agree that it's well possible that the spike of memory could lead to crashing the app. Perhaps, if my app is already running high in terms of dyno memory (hence slower to respond), or if there is a long running task, it could skip sending "ping" to the server, and the server thinks the connection is dead. Reconnection attempts would increase the memory load, slowing even more the app - until it crashes. Would that make sense?

One more thing I'd like to ask you, perhaps, would be to run again these tests in SDK 2.0.6. I saw much less issues with 2.0.6 versus 3.x, and I have the impression that something is different in terms of memory usage. I use Node 5 in both cases.

mvaragnat on 16 Nov 2016

i just wanted to leave a link here to a tool i thought might be useful for any future analysis: https://github.com/andywer/leakage

aoberoi on 28 Dec 2016

seeing as how we haven't had any other recent reports of this issue, i'm moving it to the "needs feedback" category. i'll need more data or reports of this issue in the wild in order to make progress.

aoberoi on 3 Oct 2017

i'm going to close this issue because it hasn't gotten any engagement in a long time, and its probably only relevant to v3. if you do find this issue is impacting you, feel free to comment and i will reopen.

aoberoi on 10 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings