Ws: Per-Message Deflate Memory Leak

Created on 11 Aug 2019  路  26Comments  路  Source: websockets/ws

_sorry for not following the issue template, this way is easier to explain the issue, I will be happy to elaborate more_

Description

In a production server after running a long time / mass-disconnection, I have noticed memory leak according to zlib & websockets.

This happens to all servers, but extremely happened to one of our servers which got mass-disconnect from 2.5k connected users to 50 connceted users. memory dump on this server showed:

  • We have 87k instances of WebSocket class
  • AsyncLimiter Queue have thousands of items

I managed to reproduce locally via:

  • socket.io server (this my setup) with per-message deflate enabled
  • connected with 200 users simultaneously and disconnected at once - memory stays the same, did manual garbage collection, wait for 2 for minutes, taking memory dump shows server still have WebSocket classes in memory.
  • Connected again with 20 users and now I can't connect - not getting responses from the server. the server is stuck.

Root cause

Because of AsyncLimiter have 10 pending jobs that never finished, this happens as a result of a race - calling close on inflate/deflate won't close any further flushes.

Example:

inflate.write(...);
inflate.close(); // happened async
inflate.flush(() => /* won't be called */);

I made some patches in the code to make sure I call the callback, but I didn't like the solution of attaching a callback to inflate/deflate objects, since the object can handle multiple operations at the same time and we can have further racers.

In my opinion, we should decouple the inflate/deflate from per-message deflate and create a class which will have an internal queue and call to inflate/deflate, one cleanup it will flush the queue - this way we can the solve the issues I wrote above.

I can submit a pull request for the above approach.

Reproducible in:

  • ws version: 6.4.1
  • socket.io version: 2.2.0
  • Node.js version(s): v10.15.3
  • OS version(s): Mac (local) / Linux (production)

Memory dump screenshot

Screen Shot 2019-08-11 at 16 12 25

The picture above is a memory dump for the server which had mass-disconnect from 2.5k to 50 sockets:

  • 88,992 instance of PerMessageDeflate
  • 178,012 instances of WebSocket
  • Array / Queue - is AsyncLimiter queue

All 26 comments

async-limiter was added in https://github.com/websockets/ws/pull/1204 to work around this issue https://github.com/nodejs/node/issues/8871, specifically https://github.com/nodejs/node/issues/8871#issuecomment-250915913.

If I understand correctly the issue now is that it keeps stuff in its queue indefinitely due to never completing jobs (because flush() callback might never be called) right?

@lpinca hi!

I read both issues and the history behind it, but yes basically the issue that jobs are never completing and this way we don't clean up properly upon disconnection.

Are you sure this happens with inflate streams and not deflate? Asking because due to how the WebSocket is closed, close() should not be called on the deflate stream until there is data to read.

For deflate streams instead close() can also be called while data is being compressed.

@lpinca this happens on deflate, sorry

Ok it makes sense and I can see why jobs might never complete. This also made things worse

https://github.com/websockets/ws/blob/fa991731cca990f40ecedb120918d14d08129673/lib/permessage-deflate.js#L424-L432

but reverting it would cause other problems. I'm not sure how to properly fix this.

@lpinca in my tests the flush callback is never called after you call close, I suspected at this code but it wasn't relevant

If you are using Node.js 10 it is expected, it was called before. Anyway we have to store the callback somewhere and call it when the final close() completes without actually calling this other callback

https://github.com/websockets/ws/blob/fa991731cca990f40ecedb120918d14d08129673/lib/permessage-deflate.js#L315

if _deflate was nulled by the final close.

Thank you for the report. If you want to try to fix it please do. I'm not sure when I can do it.

@lpinca I have some fix currently and I don't like it.

Which is putting the callback in inflate as well, and on PerMessageDeflate#cleanup call those callbacks. but this is problematic IMO because we are settings callback on shared-resource.

A better fix IMO is to have a class which does deflate/inflate and have in its queue - data (to compress) and promise, this way we don't have a direct reference to WebSocket.

I'll play with this approach and update you

Which is putting the callback in inflate as well, and on PerMessageDeflate#cleanup call those callbacks. but this is problematic IMO because we are settings callback on shared-resource.

It is not a big problem. There is one PerMessageDeflate instance per WebSocket and one inflate/deflate stream per PerMessageDeflate instance. Furthermore there can only be a single write/compress at time so we have to save/call only the latest callback.

@lpinca how do you guarantee that you have single write at the time?

and what about reads (deflate) ?

They should not have the issue described here because close() should never be called on inflate streams while data is being decompressed due to how the WebSocket is closed.

Cool @lpinca ! I will make a PR.

I started re-making my old, the question is what should I pass in the arguments to the callback, passing data their I think is in-correct, I think I should pass Error make sense?

I would just call it without arguments. This callback

https://github.com/websockets/ws/blob/fa991731cca990f40ecedb120918d14d08129673/lib/permessage-deflate.js#L315

must not be called after PerMessageDeflate#cleanup() is called or we will end up writing on a closed socket. I think we need a guard for this.

We only have to ensure this

https://github.com/websockets/ws/blob/fa991731cca990f40ecedb120918d14d08129673/lib/permessage-deflate.js#L314

is called to signal that the job completed.

Umm, I actually thought about adding code in PerMessageDeflate#cleanup that will call the callback.

I am not sure how can I call the done() there without calling the callback? or I am not understanding you correctly?

Yes do in PerMessageDeflate#cleanup() but also add a guard before calling the callback so that the outer callback on line 351 will not be called. Makes sense?

Oh, you want here - https://github.com/websockets/ws/blob/fa991731cca990f40ecedb120918d14d08129673/lib/permessage-deflate.js#L314

something like:

    zlibLimiter.push((done) => {
      this._compress(data, fin, (err, result) => {
        done();
        if(result) { // call only if we got result
            callback(err, result);
        }
      });
    });

and in cleanup - call the callback like callback() ?

Something like this:

diff --git a/lib/permessage-deflate.js b/lib/permessage-deflate.js
index 29881b3..96fcf8b 100644
--- a/lib/permessage-deflate.js
+++ b/lib/permessage-deflate.js
@@ -134,6 +134,10 @@ class PerMessageDeflate {

     if (this._deflate) {
       this._deflate.close();
+      if (this._deflate.cb) {
+        this._doNotCallOuterCallback = true;
+        this._deflate.cb();
+      }
       this._deflate = null;
     }
   }
@@ -312,6 +316,8 @@ class PerMessageDeflate {
     zlibLimiter.push((done) => {
       this._compress(data, fin, (err, result) => {
         done();
+        if (this._doNotCallOuterCallback) return;
+
         callback(err, result);
       });
     });

with this._deflate.cb set where needed and possible cleaner :) Hope it makes sense.

make sense, thanks I'll PR later!

Here is something cleaner without using an additional flag:

diff --git a/lib/permessage-deflate.js b/lib/permessage-deflate.js
index 29881b3..204816e 100644
--- a/lib/permessage-deflate.js
+++ b/lib/permessage-deflate.js
@@ -134,6 +134,7 @@ class PerMessageDeflate {

     if (this._deflate) {
       this._deflate.close();
+      if (this._deflate[kCallback] !== null) this._deflate[kCallback]();
       this._deflate = null;
     }
   }
@@ -312,7 +313,7 @@ class PerMessageDeflate {
     zlibLimiter.push((done) => {
       this._compress(data, fin, (err, result) => {
         done();
-        callback(err, result);
+        if (err || result) callback(err, result);
       });
     });
   }
@@ -419,6 +420,8 @@ class PerMessageDeflate {
       this._deflate.on('data', deflateOnData);
     }

+    this._deflate[kCallback] = callback;
+
     this._deflate.write(data);
     this._deflate.flush(zlib.Z_SYNC_FLUSH, () => {
       if (!this._deflate) {
@@ -431,6 +434,8 @@ class PerMessageDeflate {
         return;
       }

+      this._deflate[kCallback] = null;
+
       let data = bufferUtil.concat(
         this._deflate[kBuffers],
         this._deflate[kTotalLength]

@lpinca liked your approach, removing the flag and updating the PR.

I'm not sure if this is related, but I'm having an issue where after having the ws server running for a few days, suddenly the server just starts ignoring all messages from the client (except for the connection message). After 4 hours of debugging, I finally found that the messages disappear in the call to decompress because they are just pushed into the zlibLimiter queue but the queue is never consumed, so the callback is never called.

image

Now in general, it seems insane to me to do decompression of like 20byte segments asynchronously?? but ignoring that I have no idea why this happens. Is this something that might be fixed with this patch? I'm using version [email protected] that does not contain it.

@phiresky the scenario I wrote in the PR is the same as yours, please try to upgrade to latest version and report back.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sherikapotein picture sherikapotein  路  3Comments

bartosz-m picture bartosz-m  路  3Comments

makc picture makc  路  4Comments

ORESoftware picture ORESoftware  路  3Comments

HanHtoonAung picture HanHtoonAung  路  3Comments